A taxonomy of tools and approaches for distributed genomic analyses

Garzón, Wilmer; Benavides, Luis Alberto; Gignard, Alban; Südholt, Mario

Publication:
A taxonomy of tools and approaches for distributed genomic analyses

Files

A taxonomy of tools and approaches for distributed genomic analyses.pdf (1.78 MB)

View Statistics

Reference Managers

Indexers

Google Scholar CORE

QR Code

Authors

Garzón, Wilmer
Benavides, Luis Alberto
Gignard, Alban
Südholt, Mario

Abstract (Spanish)

La cantidad de datos biomédicos recopilados y almacenados ha aumentado significativamente. El análisis de estas grandes cantidades de datos ya no lo pueden realizar individuos ni organizaciones individuales. Así, la comunidad científica está creando esfuerzos colaborativos globales para analizar estos datos. Sin embargo, los datos biomédicos están sujetos a varias restricciones legales y socioeconómicas que obstaculizan las posibilidades de colaboración en investigación. En este artículo, sostenemos que los investigadores necesitan nuevas herramientas y técnicas para abordar las restricciones y necesidades de las colaboraciones científicas globales sobre datos biomédicos geodistribuidos. Estas herramientas y técnicas deben respaldar lo que llamamos Colaboraciones Totalmente Distribuidas (FDC), que son esfuerzos de investigación que aprovechan los medios para explotar y analizar información biomédica masiva de manera colaborativa respetando las restricciones legales y socioeconómicas. En primer lugar, este artículo motiva y analiza los requisitos de los CDF en el contexto de una colaboración de investigación sobre el desarrollo de herramientas de diagnóstico y predicción del riesgo de formación y rotura de aneurismas intracraneales (el proyecto ICAN). Luego, el artículo presenta una taxonomía que clasifica las herramientas y técnicas actuales para el análisis biomédico con respecto a los requisitos propuestos. La taxonomía considera tres características arquitectónicas clave para admitir escenarios FDC: ubicación de datos y cálculos, privacidad y seguridad, y rendimiento y escalabilidad. La revisión revela nuevas oportunidades de investigación para diseñar herramientas y técnicas para análisis multisitio que fomenten colaboraciones científicas y al mismo tiempo mitiguen las limitaciones técnicas y legales.

Abstract (English)

The amount of biomedical data collected and stored has grown significantly. Analyzing these extensive amounts of data cannot be done by individuals or single organizations anymore. Thus, the scientific community is creating global collaborative efforts to analyze these data. However, biomedical data is subject to several legal and socio- economic restrictions hindering the possibilities for research collaboration. In this paper, we argue that researchers require new tools and techniques to address the restrictions and needs of global scientific collaborations over geo-distributed biomedical data. These tools and techniques must support what we call Fully Distributed Collaborations (FDC), which are research endeavors that harness means to exploit and analyze massive biomedical information collaboratively while respecting legal and socio-economical restrictions. This paper first motivates and discusses the requirements of FDCs in the context of a research collaboration on the development of diagnostic and predictive tools for the risk of intracranial aneurysm formation and rupture (the ICAN project). The paper then presents a taxonomy classifying the current tools and techniques for biomedical analysis with respect to the proposed requirements. The taxonomy considers three key architectural features to support FDC scenarios: data and computation placement, Privacy and Security, and Performance and Scalability. The review reveals new research opportunities to design tools and techniques for multi-site analyses encouraging scientific collaborations while mitigating technical and legal constraints.

Extent

17 páginas

URI: https://repositorio.escuelaing.edu.co/handle/001/3156

Collections

AD - CTG – Informática

References

Abouelhoda M, Issa SA, Ghanem M. Tavaxy: integrating taverna and galaxy workflows with cloud computing support. BMC Bioinfo 2012;13:77. https://doi. org/10.1186/1471-2105-13-77

Abu-Doleh A, Catalyurek UV. Spaler: spark and GraphX based de novo genome assembler. In: 2015 IEEE international conference on big data (big data). IEEE; 2015. https://doi.org/10.1109/bigdata.2015.7363853

Abuín JM, Pichel JC, Pena TF, Amigo J. SparkBWA: speeding up the alignment of high-throughput DNA sequencing data. PLOS ONE 2016;11:e0155461. https:// doi.org/10.1371/journal.pone.0155461

Al-Zoubi K, Wainer G. Modelling fog amp; cloud collaboration methods on large scale. In: 2020 winter simulation conference. WSC); 2020. p. 2161–72. https:// doi.org/10.1109/WSC48552.2020.9384058

Almeida JS, Grüneberg A, Maass W, Vinga S. Fractal MapReduce decomposition of sequence alignment. Algorithm Mol Biol 2012;7. https://doi.org/10.1186/ 1748-7188-7-12

ANR. IntraCranial ANeurysms: from familial forms to pathophysiological mechanisms – I-CAN. 2019. http://www.agence-nationale-recherche.fr/Project- ANR-15-CE17-0008. [Accessed 10 October 2019]

Atkinson M, Gesing S, Montagnat J, Taylor I. Scientific workflows: past, present and future. 2017. https://doi.org/10.1016/j.future.2017.05.041

Barillot C, Bannier E, Commowick O, Corouge I, Baire A, Fakhfakh I, Guillaumont J, Yao Y, Kain M. Shanoir: applying the software as a service distribution model to manage brain imaging research repositories. Front ICT 2016;3:25. URL: https://www.frontiersin.org/article/10.3389/fict.2016.00025

Barseghian D, Altintas I, et al. Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecol Inf 2010;5:42–50. https://doi.org/10.1016/j.ecoinf.2009.08.008

Bez M, Fornari G, Vardanega T. The scalability challenge of ethereum: an initial quantitative analysis. In: 2019 IEEE international conference on service-oriented system engineering (SOSE). IEEE; 2019. https://doi.org/10.1109/ sose.2019.00031

Bondiombouy C, Valduriez P. Query processing in multistore systems: an overview. Int J Cloud Comput 2016;5:309–46

zahra Boujdad F, Sudholt M. Constructive privacy for shared genetic data. In: Proceedings of the 8th international conference on cloud computing and services science. SCITEPRESS - Science and Technology Publications; 2018. https://doi. org/10.5220/0006765804890496

Boujdad FZ, Gaignard A, et al. On distributed collaboration for biomedical analyses. In: 2019 19th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), IEEE; 2019. https://doi.org/10.1109/ ccgrid.2019.00079

Boujdad FZ, Niyitegeka D, Bellafqira R, Gouenou C, Emmanuelle G, Südholt M. A hybrid cloud deployment architecture for privacy-preserving collaborative genome-wide association studies. In: ICDF2C 2021 - 12th EAI international conference on digital forensics & cyber crime; 2021

Bourcier R, Chatel S, et al. Understanding the pathophysiology of intracranial aneurysm: the ICAN project. Neurosurgery 2017;80:621–6. https://doi.org/ 10.1093/neuros/nyw135

Bux M, Brandt J, Witt C, Dowling J, Leser U. Hi-way: execution of scientific workflows on hadoop yarn. In: 20th international conference on extending database technology, EDBT 2017, 21 march 2017 through 24 march 2017, Open Proceedings. Org; 2017. p. 668–79. https://doi.org/10.5441/002/edbt.2017.87

Bux M, Leser U. Parallelization in scientific workflow management systems. 2013. arXiv preprint arXiv:1303.7195

Canali C, Lancellotti R, Mione S. Collaboration strategies for fog computing under heterogeneous network-bound scenarios. In: 2020 IEEE 19th international symposium on network computing and applications. NCA); 2020. p. 1–8. https:// doi.org/10.1109/NCA51143.2020.9306730

Cano I, Weimer M, Mahajan D, Curino C, Fumarola GM. Towards geo-distributed machine learning. 2016. arXiv preprint arXiv:1603.09035

de Castro MR, dos Santos Tostes C, et al. SparkBLAST: scalable BLAST processing using in-memory operations. BMC Bioinf 2017;18. https://doi.org/10.1186/ s12859-017-1723-8

Cattaneo G, Giancarlo R, et al. MapReduce in computational biology - a synopsis. 10.1007%2F978-3-319-57711-1_5. In: Advances in artificial life, evolutionary computation, and systems chemistry. Springer International Publishing; 2017. p. 53–64. URL

Cattaneo G, Petrillo UF, Giancarlo R, Roscigno G. An effective extension of the applicability of alignment-free biological sequence comparison algorithms with hadoop. J Supercomput 2016;73:1467–83. https://doi.org/10.1007/s11227-016- 1835-3

Chang YJ, Chen CC, Chen CL, Ho JM. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. In: BMC genomics, BioMed central; 2012. S28. https://doi.org/ 10.1186/1471-2164-13-S7-S28

Chen Z, Hu J, Min G, Chen X. Effective data placement for scientific workflows in mobile edge computing using genetic particle swarm optimization. Concurrency Comput: Pract Ex 2019;e5413doi. https://doi.org/10.1002/cpe.5413

Chervenak A, Deelman E, Foster I, Guy L, Hoschek W, Iamnitchi A, Kesselman C, Kunszt P, Ripeanu M, Schwartzkopf B, Stockinger H, Stockinger K, Tierney B. Giggle: a framework for constructing scalable replica location services. In: ACM/ IEEE SC 2002 conference (SC’02), IEEE; 2002. https://doi.org/10.1109/ sc.2002.10024

Claerhout B, DeMoor G. Privacy protection for clinical and genomic data: the use of privacy-enhancing techniques in medicine. Int J Med Inf 2005;74:257–65.

Cohen-Boulakia S, Belhajjame K, et al. Scientific workflows for computational reproducibility in the life sciences: status, challenges and opportunities. Future Generat Comput Syst 2017;75:284–98. https://doi.org/10.1016/j. future.2017.01.012

Colosimo ME, Peterson MW, Mardis S, Hirschman L. Nephele: genotyping via complete composition vectors and MapReduce. Source Code Biol Med 2011;6. https://doi.org/10.1186/1751-0473-6-13

Commission, E., Council. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data. http://data.europa.eu/eli/reg/2016/679/2016-05-04; 2016

Congress of Colombia. Colombian data protection law. URL: https://www.fun cionpublica.gov.co/eva/gestornormativo/norma.php?i=49981. [Accessed 16 September 2021]

Consortium DS, Consortium DM, Mahajan A, et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nature genetics 2014;46:234. https://doi.org/10.1038/ng.2897

Cook CE, Lopez R, et al. The european bioinformatics institute in 2018: tools, infrastructure and training. Nucleic Acids Res 2018;47:D15–22. https://doi.org/ 10.1093/nar/gky1124

Cope JM, Trebon N, Tufo HM, Beckman P. Robust data placement in urgent computing environments. In: 2009 IEEE international symposium on parallel & distributed processing. IEEE; 2009. p. 1–13. https://doi.org/10.1109/ IPDPS.2009.5160914

Corpas M, Kovalevskaya NV, McMurray A, Nielsen FG. A fair guide for data providers to maximise sharing of human genomic data. PLoS Comput Biol 2018; 14:e1005873. https://doi.org/10.1371/journal.pcbi.1005873

De Moor G, Claerhout B, De Meyer F. Privacy enhancing techniques. Method Inf Med 2003;42:148–53

De Roure D, Belhajjam K, Missier P, G´ omez-P´ erez JM, Palma R, Ruiz JE, Hettne K, Roos M, Klyne G, Goble C. Towards the preservation of scientific workflows. In: iPRES 2011-8th international conference on preservation of digital objects. National Library Board Singapore and Nanyang Technology University; 2011. p. 228–31

De Wit P, Pespeni MH, et al. The simple fool’s guide to population genomics via rna-seq: an introduction to high-throughput sequencing data analysis. Mol Eco Res 2012;12:1058–67. https://doi.org/10.1111/1755-0998.12003

Decap D, Reumers J, Herzeel C, Costanza P, Fostier J. Halvade: scalable sequence analysis with MapReduce. Bioinformatics 2015;31:2482–8. https://doi.org/ 10.1093/bioinformatics/btv179

Deelman E, Gannon D, et al. Workflows and e-science: an overview of workflow system features and capabilities. Future Generat Comput Syst 2009;25:528–40. https://doi.org/10.1016/j.future.2008.06.012

Deelman E, Vahi K, et al. Pegasus, a workflow management system for science automation. Future Generat Comput Syst 2015;46:17–35. https://doi.org/ 10.1016/j.future.2014.10.008

Dolev S, Florissi P, et al. A survey on geographically distributed big-data processing using MapReduce. IEEE Transact Big Data 2019;5:60–80. https://doi. org/10.1109/tbdata.2017.2723473

Dong G, Fu X, Li H, Pan X. An accurate sequence assembly algorithm for livestock, plants and microorganism based on spark. Int J Pattern Recognit Artif Intell 2017; 31:1750024. https://doi.org/10.1142/s0218001417500240

Ebrahimi M, Mohan A, Kashlev A, Lu S. Bdap: a big data placement strategy for cloud-based scientific workflows. In: 2015 IEEE first international conference on big data computing service and applications. IEEE; 2015. p. 105–14. https://doi. org/10.1109/BigDataService.2015.70

Elmroth E, Hern´andez F, Tordsson J. Three fundamental dimensions of scientific workflow interoperability: model of computation, language, and execution environment. Future Generat Comput Syst 2010;26:245–56

Fakas GJ, Karakostas B. A peer to peer (P2P) architecture for dynamic workflow management. Inf Software Technol 2004;46:423–31

Fan J, Han F, Liu H. Challenges of big data analysis. Nat Sci Rev 2014;1:293–314. https://doi.org/10.1093/nsr/nwt032

Federer LM, Lu YL, et al. Biomedical data sharing and reuse: attitudes and practices of clinical and scientific research staff. PLOS ONE 2015;10:e0129506. https://doi.org/10.1371/journal.pone.0129506

Freire J, Bonnet P, Shasha D. Computational reproducibility: state-of-the-art, challenges, and database research opportunities. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data; 2012. p. 593–6

Frye SV, Arkin MR, et al. Tackling reproducibility in academic preclinical drug discovery. Nat Rev Drug Discovery 2015;14:733–4. https://doi.org/10.1038/ nrd4737

Gil Y, Ratnakar V, et al. Wings: intelligent workflow-based design of computational experiments. IEEE Intell Syst 2011;26:62–72. https://doi.org/ 10.1109/mis.2010.9

Gilbert S, Lynch N. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 2002;33:51–9. https://doi.org/ 10.1145/564585.564601

Goecks J, Nekrutenko A, Taylor J, Team TG. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010;11:R86. https://doi.org/10.1186/gb-2010- 11-8-r86.

Goodman SN, Fanelli D, Ioannidis JPA. What does research reproducibility mean? Sci Translat Med 2016;8. https://doi.org/10.1126/scitranslmed.aaf5027. 341ps12–341ps12

Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next- generation sequencing technologies. Nature Rev Genet 2016;17:333

Guo R, Zhao Y, Zou Q, et al. Bioinformatics applications on Apache spark. GigaScience 2018. https://doi.org/10.1093/gigascience/giy098

of Health NI, et al. Guidance: rigor and reproducibility in grant applications. 2017

Huang H, Tata S, Prill RJ. BlueSNP: R package for highly scalable genome-wide association studies using hadoop clusters. Bioinformatics 2012;29:135–6. https:// doi.org/10.1093/bioinformatics/bts647

Huang L, Krüger J, Sczyrba A. Analyzing large scale genomic data on the cloud with sparkhit. Bioinformatics 2017;34:1457–65. https://doi.org/10.1093/ bioinformatics/btx808

Huang Y, Gottardo R. Comparability and reproducibility of biomedical data. Briefings Bioinfo 2012;14:391–401. https://doi.org/10.1093/bib/bbs078

Hung CL, Lin YL, Hua GJ, Hu YC. CloudTSS: a TagSNP selection approach on cloud computing. In: Communications in computer and information science. Springer Berlin Heidelberg; 2011. p. 525–34. https://doi.org/10.1007/978-3- 642-27180-9_64

Hutson S. Data handling errors spur debate over clinical trial. 618–618 Nature Med 2010;16. https://doi.org/10.1038/nm0610-618a

Karim MR, Michel A, et al. Improving data workflow systems with cloud services and use of open data for bioinformatics research. Briefings Bioinfo 2017;19: 1035–50. https://doi.org/10.1093/bib/bbx039

Khan A, Kim T, Byun H, Kim Y. Scispace: a scientific collaboration workspace for geo-distributed hpc data centers. Future Generat Comput Syst 2019;101:398–409.

Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: a review of best practices and their practical application in cwlprov. GigaScience 2019;8:giz095

Kim D, Vouk MA. Assessing run-time overhead of securing kepler. Procedia Comput Sci 2016;80:2281–6. https://doi.org/10.1016/j.procs.2016.05.412

Kim JH. Genome data analysis. Springer Singapore; 2019. URL: https://www.sp ringer.com/gp/book/9789811319419

Koster J, Rahmann S. Snakemake–a scalable bioinformatics workflow engine. Bioinfo 2012;28:2520–2. https://doi.org/10.1093/bioinformatics/bts480

Kuhn K, et al. The cancer biomedical informatics grid (cabig): infrastructure and applications for a worldwide research community. Medinfo 2007;1:330

Langmead B, Hansen KD, Leek JT. Cloud-scale RNA-sequencing differential expression analysis with myrna. Genome Biol 2010;11:R83. https://doi.org/ 10.1186/gb-2010-11-8-r83

Langmead B, Schatz MC, et al. Searching for SNPs with cloud computing. Genome Biol 2009;10:R134. https://doi.org/10.1186/gb-2009-10-11-r134

Legislature CS. The California consumer privacy act of. 2018. https://leginfo.legi slature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180SB1121

Leo S, Santoni F, Zanetti G. Biodoop: bioinformatics on hadoop. In: 2009 international conference on parallel processing workshops. IEEE; 2009. https:// doi.org/10.1109/icppw.2009.37

Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J. SNP detection for massively parallel whole-genome resequencing. Genome Res 2009;19:1124–32. https://doi.org/10.1101/gr.088013.108

Li X, Zhang L, et al. A novel workflow-level data placement strategy for data- sharing scientific cloud workflows. IEEE Transact Serv Comput 2016. https://doi. org/10.1109/TSC.2016.2625247

Liu J, Pacitti E, Valduriez P, Mattoso M. Parallelization of scientific workflows in the cloud. 2014

Liu J, Pacitti E, Valduriez P, Mattoso M. A survey of data-intensive scientific workflow management. J Grid Comput 2015;13:457–93. https://doi.org/ 10.1007/s10723-015-9329-8

Liu J, Pacitti E, Valduriez P, Mattoso M. Scientific workflow scheduling with provenance data in a multisite cloud. In: Transactions on large-scale data-and knowledge-centered systems XXXIII. Springer; 2017. p. 80–112

Liu J, Pineda L, Pacitti E, Costan A, Valduriez P, Antoniu G, Mattoso M. Efficient scheduling of scientific workflows using hot metadata in a multisite cloud. IEEE Transact Knowl Data Eng 2019;31:1940–53. https://doi.org/10.1109/ tkde.2018.2867857

Liu X, Datta A. Towards intelligent data placement for scientific workflows in collaborative cloud environment. In: 2011 IEEE international symposium on parallel and distributed processing workshops and phd forum. IEEE; 2011. p. 1052–61. https://doi.org/10.1109/IPDPS.2011.259

Liu Y, Zhang L, Ge N, Li G. A systematic literature review on federated learning: from a model quality perspective. 2020. arXiv preprint arXiv:2012.01973

Lu S, Zhang J. Collaborative scientific workflows supporting collaborative science. Int J Bus Process Integrat Manag 2011;5:185. https://doi.org/10.1504/ ijbpim.2011.040209

Lu YY, Tang K, et al. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic acids research 2017;45:W554–9. https://doi.org/10.1093/nar/gkx351

Malin BA, Emam KE, O’Keefe CM. Biomedical data privacy: problems, perspectives, and recent advances. 2013

McKenna A, Hanna M, Banks E, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297–303. https://doi.org/10.1101/gr.107524.110

McMahan B, Moore E, Ramage D, Hampson S, y Arcas BA. Communication- efficient learning of deep networks from decentralized data. In: Singh A, Zhu J, editors. Proceedings of the 20th international conference on artificial intelligence and statistics. Fort Lauderdale, FL, USA: PMLR; 2017. p. 1273–82. URL: http://pr oceedings.mlr.press/v54/mcmahan17a.html

Moreau L, Missier P, Cheney J, Soiland-Reyes S. Prov-n: the provenance notation. 2013

Nagappan M, Vouk MA. A model for sharing of confidential provenance information in a query based system. In: International provenance and annotation workshop. Springer; 2008. p. 62–9. https://doi.org/10.1007/978-3-540-89965-5_ 8

Nguyen T, Shi W, Ruden D. CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 2011;4. https://doi.org/ 10.1186/1756-0500-4-171

NHGRI-EBI. GWAS catalog. 2019. https://www.ebi.ac.uk/gwas/. accessed 20- Sept-2019

NIH-BMIC. NIH data sharing repositories. 2019. https://www.nlm.nih.gov/NIH bmic/nih_data_sharing_repositories.html. accessed 20-Sept-2019

Nordberg H, Bhatia K, Wang K, Wang Z. BioPig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 2013;29:3014–9. https://doi.org/ 10.1093/bioinformatics/btt528

NSF, 2019. Chapter XI - Other Post Award Requirements and Consideration. https://www.nsf.gov/pubs/policydocs/pappg19_1/pappg_11.jsp\#XID4. [Online; accessed 20-June-2019]

O’Brien AR, Saunders NFW, et al. VariantSpark: population scale clustering of genotype information. BMC Genom 2015;16. https://doi.org/10.1186/s12864- 015-2269-7

Pandey RV, Schl¨otterer C. DistMap: a toolkit for distributed short read mapping on a hadoop cluster. PLoS ONE 2013;8:e72614. https://doi.org/10.1371/journal. pone.0072614

Papageorgiou L, Eleni P, et al. Genomic big data hitting the storage bottleneck. EMBnetjournal 2018;24:e910. https://doi.org/10.14806/ej.24.0.910

Parks R, Chu CH, Xu H. Healthcare information privacy research: iusses, gaps and what next? AMCIS; 2011

Peteiro-Barral D, Guijarro-Berdi˜ nas B. A survey of methods for distributed machine learning. Prog Artif Intell 2013;2:1–11

Pineda-Morales L, Costan A, Antoniu G. Towards multi-site metadata management for geographically distributed cloud workflows. In: 2015 IEEE international conference on cluster computing. IEEE; 2015. p. 294–303. https:// doi.org/10.1109/cluster.2015.49

Pineda-Morales L, Liu J, Costan A, Pacitti E, Antoniu G, Valduriez P, Mattoso M. Managing hot metadata for scientific workflows on multisite clouds. In: 2016 IEEE international conference on big data (big data). IEEE; 2016. p. 390–7

Pireddu L, Leo S, Zanetti G. SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 2011;27:2159–60. https://doi.org/10.1093/ bioinformatics/btr325

Rasheed Z, Rangwala H. A map-reduce framework for clustering metagenomes. In: 2013 IEEE international symposium on parallel & distributed processing, workshops and phd forum. IEEE; 2013. https://doi.org/10.1109/ ipdpsw.2013.100

Rodriguez MA, Buyya R. Scientific workflow management system for clouds. In: Software architecture for big data and the cloud. Elsevier; 2017. p. 367–87. https://doi.org/10.1016/b978-0-12-805467-3.00018-1

Ross RB, Thakur R, et al. Pvfs: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference; 2000. p. 391–430

Salloum S, Dautov R, et al. Big data analytics on Apache spark. Int J Data Sci Anal 2016;1:145–64. https://doi.org/10.1007/s41060-016-0027-9

Santana-Perez I, P´ erez-Hern´ andez MS. Towards reproducibility in scientific workflows: an infrastructure-based approach. Scientific Program 2015:1–11. https://doi.org/10.1155/2015/243180

Schadt EE, Linderman MD, et al. Computational solutions to large-scale data management and analysis. Nature Rev Genet 2010;11:647–57. https://doi.org/ 10.1038/nrg2857

Schatz MC. BlastReduce: high performance short read mapping with MapReduce. University of Maryland; 2008. http://cgis.cs.umd.edu/Grad/scholarlypapers/pa pers/MichaelSchatz.pdf

Schatz MC, Sommer D, Kelley D, Pop M. De novo assembly of large genomes using cloud computing. In: Proceedings of the cold spring harbor biology of genomes conference; 2010

Senturk IF, Balakrishnan P, et al. A resource provisioning framework for bioinformatics applications in multi-cloud environments. Future Generat Comput Syst 2018;78:379–91. https://doi.org/10.1016/j.future.2016.06.008

Sharov AA, Schlessinger D, Ko MSH. ExAtlas: an interactive online tool for meta- analysis of gene expression data. J Bioinfo Comput Biol 2015;13:1550019. https://doi.org/10.1142/s0219720015500195

Soiland-Reyes S, Alper P, Goble C. Tracking workflow execution with tavernaprov. In: PROV: three tears later: Provenance Week 2016; 2016

Stephens ZD, Lee SY, et al. Big data: astronomical or genomical? PLOS Biology 2015;13:e1002195. https://doi.org/10.1371/journal.pbio.1002195

Tannenbaum T, Wright D, Miller K, Livny M. Condor: a distributed job scheduler. In: Beowulf cluster computing with windows; 2001. p. 307–50

Taylor I, Shields M, Wang I, Harrison A. The triana workflow environment: architecture and applications. In: Workflows for e-Science. Springer; 2007. p. 320–39. https://doi.org/10.1007/978-1-84628-757-2_20

Taylor IJ, Deelman E, et al. Workflows for e-Science: scientific workflows for grids, ume 1. Springer; 2007. https://doi.org/10.1007/978-1-84628-757-2

Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the condor experience. Concurr Comput: Pract Exp 2005;17:323–56. https://doi.org/ 10.1002/cpe.938

Tommaso PD, Chatzou M, et al. Nextflow enables reproducible computational workflows. Nature Biotechnol 2017;35:316–9. https://doi.org/10.1038/ nbt.3820

Turakhia MP, Desai M, Hedlin H, Rajmane A, Talati N, Ferris T, Desai S, Nag D, Patel M, Kowey P, Rumsfeld JS, Russo AM, Hills MT, Granger CB, Mahaffey KW, Perez MV. Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: the apple heart study. Am Heart J 2019; 207:66–75. https://doi.org/10.1016/j.ahj.2018.09.002. https://www.sciencedi rect.com/science/article/pii/S0002870318302710.

Union I. Communication from the commission to the european parliament, the council, the european economic and social committee and the committee of the regions. A new skills agenda for europe. 2014 [Brussels].

Valduriez P, Mattoso M, Akbarinia R, Borges H, Camata J, Coutinho A, Gaspar D, Lemus N, Liu J, Lustosa H, et al. Scientific data analysis using data-intensive scalable computing: the scidisc project. In: LADaS: Latin America data science workshop, CEUR-WS. Org; 2018

Van Hung T, Chuanhe H. An effective data placement strategy in main-memory database cluster. In: 2011 second international conference on networking and distributed computing. IEEE; 2011. p. 93–8. https://doi.org/10.1109/ ICNDC.2011.27.

Verbraeken J, Wolting M, Katzy J, Kloppenburg J, Verbelen T, Rellermeyer JS. A survey on distributed machine learning. ACM Comput Surv (CSUR) 2020;53: 1–33

Wang J, Crawl D, Altintas I. Kepler + hadoop. In: Proceedings of the 4th workshop on workflows in support of large-scale science - WORKS ’09. ACM Press; 2009. https://doi.org/10.1145/1645164.1645176

Wang R, Li M, Peng L, Hu Y, Hassan MM, Alelaiwi A. Cognitive multi-agent empowering mobile edge computing for resource caching and collaboration. Future Generat Comput Syst 2020;102:66–74. https://doi.org/10.1016/j. future.2019.08.001. URL: https://www.sciencedirect.com/science/article/pii/ S0167739X19318783

Wang Y. Automating experimentation with distributed systems using generative techniques. Ph.D. thesis. University of Colorado at Boulder; 2006

Wang Y, Carzaniga A, Wolf AL. Four enhancements to automated distributed system experimentation methods. In: Proceedings of the 30th international conference on Software engineering; 2008. p. 491–500

Wiewi´ orka MS, Messina A, et al. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 2014;30:2652–3. https://doi.org/10.1093/bioinformatics/btu343

Wilde M, Hategan M, et al. Swift: a language for distributed parallel scripting. Parallel Comput 2011;37:633–52. https://doi.org/10.1016/j.parco.2011.05.005.

Wolstencroft K, Haines R, et al. The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res 2013;41:W557–61. https://doi.org/10.1093/nar/gkt328

Xiao Y, Zhou AC, Yang X, He B. Privacy-preserving workflow scheduling in geo- distributed data centers. Future Generat Comput Syst 2022;130:46–58

Xie J, Yin S, et al. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE international symposium on parallel & distributed processing, workshops and phd forum (IPDPSW). IEEE; 2010. p. 1–9. https://doi.org/10.1109/IPDPSW.2010.547088

Xie T. Sea: a striping-based energy-aware strategy for data placement in raid- structured storage systems. IEEE Transact Comput 2008;57:748–61. https://doi. org/10.1109/TC.2008.27

Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y. Petuum: a new platform for distributed machine learning on big data. IEEE Transact Big Data 2015;1:49–67. https://doi.org/10.1109/tbdata.2015.2472014

Xu B, Gao J, Li C. An efficient algorithm for DNA fragment assembly in MapReduce. Biochem Biophys Res Commun 2012;426:395–8. https://doi.org/ 10.1016/j.bbrc.2012.08.101

Xu B, Li C, Zhuang H, et al. DSA: scalable distributed sequence alignment system using SIMD instructions. In: 2017 17th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), IEEE; 2017. https://doi.org/ 10.1109/ccgrid.2017.74

Xu B, Li C, Zhuang H, et al. Efficient distributed smith-waterman algorithm based on Apache spark. In: 2017 IEEE 10th international conference on cloud computing (CLOUD). IEEE; 2017. https://doi.org/10.1109/cloud.2017.83

Yu HF, Hsieh CJ, Chang KW, Lin CJ. Large linear classification when data cannot f it in memory. In: ACM Transactions on Knowledge Discovery from Data (TKDD); 2012. p. 1–23. 5

Yu J, Buyya R. A taxonomy of workflow management systems for grid computing. J Grid Comput 2005;3:171–200. https://doi.org/10.1007/s10723-005-9010-8.

Yuan D, Yang Y, Liu X, Chen J. A data placement strategy in scientific cloud workflows. Future Generat Comput Syst 2010;26:1200–14. https://doi.org/ 10.1016/j.future.2010.02.004

Zhang D, Zhao L, Li B, et al. SEQSpark: a complete analysis tool for large-scale rare variant association studies using whole-genome and exome sequence data. The American J Human Genet 2017;101:115–22. https://doi.org/10.1016/j. ajhg.2017.05.017

Zhang L, Gu S, Liu Y, Wang B, Azuaje F. Gene set analysis in the cloud. Bioinformatics 2011;28:294–5. https://doi.org/10.1093/bioinformatics/btr630

Zhao G, Ling C, Sun D. SparkSW: scalable distributed computing system for large- scale biological sequence alignment. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, IEEE; 2015. https://doi.org/ 10.1109/ccgrid.2015.55

Zhao J, Gomez-Perez JM, Belhajjame K, Klyne G, Garcia-Cuesta E, Garrido A, Hettne K, Roos M, De Roure D, Goble C. Why workflows break—understanding and combating decay in taverna workflows. In: 2012 ieee 8th international conference on e-science. IEEE; 2012. p. 1–9

Zhao Q, Xiong, et al. A new energy-aware task scheduling method for data- intensive applications in the cloud. J Network Comput Appl 2016;59:14–27. https://doi.org/10.1016/j.jnca.2015.05.001

Zhao Y, Li Y, Raicu I, Lu S, Tian W, Liu H. Enabling scalable scientific workflow management in the cloud. Future Generat Comput Syst 2015;46:3–16. https:// doi.org/10.1016/j.future.2014.10.023.

Zhou W, Li R, Yuan S, et al. MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes. Bioinformatics 2017. https:// doi.org/10.1093/bioinformatics/btw750. btw750

Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017;18. https://doi. org/10.1186/s13059-017-1319-7

Zytnicki M, Quesneville H. S-MART, a software toolbox to aid RNA-seq data analysis. PLoS ONE 2011;6:e25988. https://doi.org/10.1371/journal. pone.0025988.

Publication: A taxonomy of tools and approaches for distributed genomic analyses

Files

Files

Reference Managers

Indexers

QR Code

Authors

Authors

Abstract (Spanish)

Abstract (English)

Extent

Collections

Collections

References

Publication:
A taxonomy of tools and approaches for distributed genomic analyses