MPEG-G supports several types of selective access to compressed genomic data that are important in genomic analysis. The data structure specified by MPEG-G supports partitioning and efficient querying of data according to the characteristics of the encoded genome sequencing reads. For instance, an analyst can filter out the reads in an MPEG-G file according to the following criteria:

  1. For unmapped reads
    • patterns of nucleotides of interest contained in the sequences.
    • quality criteria such as those produced by tools like FastQC
  2. For mapped reads
    • all reads mapped in a genomic region specified in terms of sequence (e.g. chromosome) and a start and end mapping position
    • reads perfectly matching the reference genome
    • reads with substitutions only (no indels or clipped bases), with an optional maximum number of substitutions
    • reads with substitutions, indels and clipped bases with no mismatches

The selected regions (potentially scattered along several chromosomes) can be labelled with a single identifier for further retrieval with a single query. This way, the analyst can easily save her work and resume it at a later time by quickly accessing only the sub-regions of interest selected in the previous sessions.

Furthermore, MPEG-G enables also to perform the same selective access actions directly on remote content, for instance to efficiently carry out large population studies over genomic databases on the web

Schedule a call