The GenCoder project aims at implementing the first prototype of MPEG-G compliant encoder and decoder for efficient compression, storage, transport and analysis of genome sequencing data.
While genomics analysis is poised to become the major generator of big data by 2025, with already 2 million genomes sequenced so far thanks to Next-Generation Sequencing (NGS) devices, the stakeholders involved in genomic data analysis and management (research and clinic centers, bio-banks, genome service providers) have to face two problems:
- the increasing costs of data storage (on average 850 €/TeraByte per year, which means several million € for large data repositories) and
- the lack of systems interoperability due to poorly specified interfaces, which prevent the efficient data transport and sharing needed to perform analysis on large heterogeneous datasets.
GenomSys is participating to the process of standardization of MPEG-G, the new ISO standard for genomic information representation, and it is implementing the first MPEG-G compliant encoder and decoder. MPEG-G main advantages are:
- Enhanced compression: from 50% to 100x compression according to the selected coding mode.
- Processing time reduction: for a typical genome analysis up to a 50x factor with respect to current practices, thanks to selective and rapid access to specific blocks of data and metadata.
- Open process of technology specification: the ISO process of standards development offers enterprise-grade technology specifications and long-term support and maintenance.
SME Phase 1 project
The GenCoder project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement 827840
The technical feasibility study aims at optimizing and validating the performance of the GenCoder software for encoding and streaming genomic data in real application scenarios to deliver a software library compliant with the MPEG-G ISO Standard specifications. The work will involve two partners with significant genomics datasets (more than 2 PB each). The current GenCoder libraries will be tested on data compressed in different formats (gzipped FASTQ, BAM, CRAM), from different species (human, animal, plant, bacteria) as well as cancer cell lines, RNAseq data in order to assess Key Performance
Validation of the business model, the revenue streams, and the pricing strategy aiming at setting adequate price points for all the customers categories. This will be done through the definition of business cases involving real customers in various application scenarios (research centers and hospitals, biotech and pharma companies, data repositories).