MPEG-G supports several types of selective access to compressed genomic data that are important in genomic analysis. The data structure specified by MPEG-G supports partitioning and efficient querying of data according to the characteristics of the encoded genome sequencing reads. For instance, an analyst can filter out the reads in an MPEG-G file according to the following criteria:
- For unmapped reads
- patterns of nucleotides of interest contained in the sequences.
- quality criteria such as those produced by tools like FastQC
- For mapped reads
- all reads mapped in a genomic region specified in terms of sequence (e.g. chromosome) and a start and end mapping position
- reads perfectly matching the reference genome
- reads with substitutions only (no indels or clipped bases), with an optional maximum number of substitutions
- reads with substitutions, indels and clipped bases with no mismatches
The selected regions (potentially scattered along several chromosomes) can be labelled with a single identifier for further retrieval with a single query. This way, the analyst can easily save her work and resume it at a later time by quickly accessing only the sub-regions of interest selected in the previous sessions.
Furthermore, MPEG-G enables also to perform the same selective access actions directly on remote content, for instance to efficiently carry out large population studies over genomic databases on the web