Date: July 9th, 2020, 14h
Title: GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data
Presenter: Vinicius Vielmo Cogo
Where: Zoom. If you are not an INESC TEC collaborator, please register at eventos@inesctec.pt until July 8, in order to have access to the link for the Zoom session. The webinar will be recorded.
About: InfoBlender is a periodic seminar organized by HASLab (INESC-TEC/UMinho). It aims at gathering people from academia and industry, as well as the general public, to present and discuss interesting ideas, publications, work in progress, rehearsal talks, etc. InfoBlender helps to tighten the gaps between different research areas and groups, especially those of HASLab, by blending their diverse information, hence the name InfoBlender.
Abstract: The vast datasets produced in human genomics must be efficiently stored, transferred, and processed while prioritizing storage space and restore performance. Balancing these two properties becomes challenging when resorting to traditional data compression techniques. In fact, specialized algorithms for compressing sequencing data favor the former, while large genome repositories widely resort to generic compressors (e.g., GZIP) to benefit from the latter. Notably, human beings have approximately 99.9% of DNA sequence similarity, vouching for an excellent opportunity for deduplication and its assets: leveraging inter-file similarity and achieving higher read performance. However, identity-based deduplication fails to provide a satisfactory reduction in the storage requirements of genomes. In this work, we balance space savings and restore performance by proposing GenoDedup, the first method that integrates efficient similarity-based deduplication and specialized delta-encoding for genome sequencing data. Our solution currently achieves 67.8% of the reduction gains of SPRING (i.e., the best-specialized tool in this metric) and restores data 1.62x faster than SeqDB (i.e., the fastest competitor). Additionally, GenoDedup restores data 9.96x faster than SPRING and compresses files 2.05x more than SeqDB. The paper is available here.
Biography: Vinicius Cogo, LASIGE researcher, is a PhD candidate in Informatics from the Faculty of Sciences (Ciências) of the University of Lisbon (ULisboa, Portugal). He has an MSc in Informatics from Ciências/ULisboa and a BSc in Computer Science from the Federal University of Santa Maria (UFSM, Brazil). He is a researcher at LASIGE since 2009 and has worked in 6 projects and authored more than 20 peer-reviewed publications. His research interests include distributed systems, dependability, fault tolerance, storage of critical data, and cloud computing.