In collaboration with the Carl R. Woese Institute for Genomic Biology at the University of Illinois Urbana Champaign, GenomSys is developing the first version of the popular toolkit developed by the Broad Institute which is able to process compressed MPEG-G bitstreams.
We are extending the GATK interfaces to
- perform BWA-MEM alignment on raw sequence data coded as MPEG-G files (instead of FASTQ)
- perform variant calling using Haplotype Caller on aligned MPEG-G files (instead of BAM or CRAM)
In this way we are able to demonstrate that MPEG-G enables performing alignment and variant calling on data remotely stored and streamed on the processing platform.
A first version of the demonstrator has been shown in San Diego at the 4th MPEG workshop on Genomic Information Representation during the 122nd MPEG meeting.
In the demonstrator shown in San Diego:
- Data were stored in servers @ UIUC
- GATK was running on Microsoft Azure
- MPEG-G selective access and streaming was implemented to transfer the required data from the storage servers at UIUC to the Microsoft cloud
- GATK processed MPEG-G records instead of BAM records
GenomSys is currently working on the pipeline to accelerate it by exploiting the compressed data structure of the MPEG-G data containers.