A taxonomy of tools and approaches for distributed genomic analyses

Garzón, Wilmer; Benavides, Luis Alberto; Gignard, Alban; Südholt, Mario

dc.contributor.author	Garzón, Wilmer
dc.contributor.author	Benavides, Luis Alberto
dc.contributor.author	Gignard, Alban
dc.contributor.author	Südholt, Mario
dc.date.accessioned	2024-07-11T16:51:03Z
dc.date.available	2024-07-11T16:51:03Z
dc.date.issued	2022
dc.identifier.uri	https://repositorio.escuelaing.edu.co/handle/001/3156
dc.description.abstract	The amount of biomedical data collected and stored has grown significantly. Analyzing these extensive amounts of data cannot be done by individuals or single organizations anymore. Thus, the scientific community is creating global collaborative efforts to analyze these data. However, biomedical data is subject to several legal and socio- economic restrictions hindering the possibilities for research collaboration. In this paper, we argue that researchers require new tools and techniques to address the restrictions and needs of global scientific collaborations over geo-distributed biomedical data. These tools and techniques must support what we call Fully Distributed Collaborations (FDC), which are research endeavors that harness means to exploit and analyze massive biomedical information collaboratively while respecting legal and socio-economical restrictions. This paper first motivates and discusses the requirements of FDCs in the context of a research collaboration on the development of diagnostic and predictive tools for the risk of intracranial aneurysm formation and rupture (the ICAN project). The paper then presents a taxonomy classifying the current tools and techniques for biomedical analysis with respect to the proposed requirements. The taxonomy considers three key architectural features to support FDC scenarios: data and computation placement, Privacy and Security, and Performance and Scalability. The review reveals new research opportunities to design tools and techniques for multi-site analyses encouraging scientific collaborations while mitigating technical and legal constraints.	eng
dc.description.abstract	La cantidad de datos biomédicos recopilados y almacenados ha aumentado significativamente. El análisis de estas grandes cantidades de datos ya no lo pueden realizar individuos ni organizaciones individuales. Así, la comunidad científica está creando esfuerzos colaborativos globales para analizar estos datos. Sin embargo, los datos biomédicos están sujetos a varias restricciones legales y socioeconómicas que obstaculizan las posibilidades de colaboración en investigación. En este artículo, sostenemos que los investigadores necesitan nuevas herramientas y técnicas para abordar las restricciones y necesidades de las colaboraciones científicas globales sobre datos biomédicos geodistribuidos. Estas herramientas y técnicas deben respaldar lo que llamamos Colaboraciones Totalmente Distribuidas (FDC), que son esfuerzos de investigación que aprovechan los medios para explotar y analizar información biomédica masiva de manera colaborativa respetando las restricciones legales y socioeconómicas. En primer lugar, este artículo motiva y analiza los requisitos de los CDF en el contexto de una colaboración de investigación sobre el desarrollo de herramientas de diagnóstico y predicción del riesgo de formación y rotura de aneurismas intracraneales (el proyecto ICAN). Luego, el artículo presenta una taxonomía que clasifica las herramientas y técnicas actuales para el análisis biomédico con respecto a los requisitos propuestos. La taxonomía considera tres características arquitectónicas clave para admitir escenarios FDC: ubicación de datos y cálculos, privacidad y seguridad, y rendimiento y escalabilidad. La revisión revela nuevas oportunidades de investigación para diseñar herramientas y técnicas para análisis multisitio que fomenten colaboraciones científicas y al mismo tiempo mitiguen las limitaciones técnicas y legales.	spa
dc.format.extent	17 páginas	spa
dc.format.mimetype	application/pdf	spa
dc.language.iso	eng	spa
dc.publisher	Elsevier Ltd	spa
dc.source	www.elsevier.com/locate/imu	spa
dc.title	A taxonomy of tools and approaches for distributed genomic analyses	eng
dc.type	Artículo de revista	spa
dc.type.version	info:eu-repo/semantics/publishedVersion	spa
oaire.accessrights	http://purl.org/coar/access_right/c_abf2	spa
oaire.version	http://purl.org/coar/version/c_970fb48d4fbd8a85	spa
dc.contributor.researchgroup	CTG - Informática	spa
dc.identifier.eissn	2352-9148	spa
dc.identifier.instname	Universidad Escuela Colombiana de Ingeniería Julio Garavito	spa
dc.identifier.reponame	Repositorio Digital	spa
dc.identifier.repourl	https://repositorio.escuelaing.edu.co/	spa
dc.publisher.place	Bogotá (Colombia)	spa
dc.relation.citationedition	Vol. 32 año 2022	spa
dc.relation.citationendpage	17	spa
dc.relation.citationstartpage	1	spa
dc.relation.citationvolume	32	spa
dc.relation.ispartofjournal	Informatics in Medicine Unlocked	eng
dc.relation.references	Abouelhoda M, Issa SA, Ghanem M. Tavaxy: integrating taverna and galaxy workflows with cloud computing support. BMC Bioinfo 2012;13:77. https://doi. org/10.1186/1471-2105-13-77	spa
dc.relation.references	Abu-Doleh A, Catalyurek UV. Spaler: spark and GraphX based de novo genome assembler. In: 2015 IEEE international conference on big data (big data). IEEE; 2015. https://doi.org/10.1109/bigdata.2015.7363853	spa
dc.relation.references	Abuín JM, Pichel JC, Pena TF, Amigo J. SparkBWA: speeding up the alignment of high-throughput DNA sequencing data. PLOS ONE 2016;11:e0155461. https:// doi.org/10.1371/journal.pone.0155461	spa
dc.relation.references	Al-Zoubi K, Wainer G. Modelling fog amp; cloud collaboration methods on large scale. In: 2020 winter simulation conference. WSC); 2020. p. 2161–72. https:// doi.org/10.1109/WSC48552.2020.9384058	spa
dc.relation.references	Almeida JS, Grüneberg A, Maass W, Vinga S. Fractal MapReduce decomposition of sequence alignment. Algorithm Mol Biol 2012;7. https://doi.org/10.1186/ 1748-7188-7-12	spa
dc.relation.references	ANR. IntraCranial ANeurysms: from familial forms to pathophysiological mechanisms – I-CAN. 2019. http://www.agence-nationale-recherche.fr/Project- ANR-15-CE17-0008. [Accessed 10 October 2019]	spa
dc.relation.references	Atkinson M, Gesing S, Montagnat J, Taylor I. Scientific workflows: past, present and future. 2017. https://doi.org/10.1016/j.future.2017.05.041	spa
dc.relation.references	Barillot C, Bannier E, Commowick O, Corouge I, Baire A, Fakhfakh I, Guillaumont J, Yao Y, Kain M. Shanoir: applying the software as a service distribution model to manage brain imaging research repositories. Front ICT 2016;3:25. URL: https://www.frontiersin.org/article/10.3389/fict.2016.00025	spa
dc.relation.references	Barseghian D, Altintas I, et al. Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecol Inf 2010;5:42–50. https://doi.org/10.1016/j.ecoinf.2009.08.008	spa
dc.relation.references	Bez M, Fornari G, Vardanega T. The scalability challenge of ethereum: an initial quantitative analysis. In: 2019 IEEE international conference on service-oriented system engineering (SOSE). IEEE; 2019. https://doi.org/10.1109/ sose.2019.00031	spa
dc.relation.references	Bondiombouy C, Valduriez P. Query processing in multistore systems: an overview. Int J Cloud Comput 2016;5:309–46	spa
dc.relation.references	zahra Boujdad F, Sudholt M. Constructive privacy for shared genetic data. In: Proceedings of the 8th international conference on cloud computing and services science. SCITEPRESS - Science and Technology Publications; 2018. https://doi. org/10.5220/0006765804890496	spa
dc.relation.references	Boujdad FZ, Gaignard A, et al. On distributed collaboration for biomedical analyses. In: 2019 19th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), IEEE; 2019. https://doi.org/10.1109/ ccgrid.2019.00079	spa
dc.relation.references	Boujdad FZ, Niyitegeka D, Bellafqira R, Gouenou C, Emmanuelle G, Südholt M. A hybrid cloud deployment architecture for privacy-preserving collaborative genome-wide association studies. In: ICDF2C 2021 - 12th EAI international conference on digital forensics & cyber crime; 2021	spa
dc.relation.references	Bourcier R, Chatel S, et al. Understanding the pathophysiology of intracranial aneurysm: the ICAN project. Neurosurgery 2017;80:621–6. https://doi.org/ 10.1093/neuros/nyw135	spa
dc.relation.references	Bux M, Brandt J, Witt C, Dowling J, Leser U. Hi-way: execution of scientific workflows on hadoop yarn. In: 20th international conference on extending database technology, EDBT 2017, 21 march 2017 through 24 march 2017, Open Proceedings. Org; 2017. p. 668–79. https://doi.org/10.5441/002/edbt.2017.87	spa
dc.relation.references	Bux M, Leser U. Parallelization in scientific workflow management systems. 2013. arXiv preprint arXiv:1303.7195	spa
dc.relation.references	Canali C, Lancellotti R, Mione S. Collaboration strategies for fog computing under heterogeneous network-bound scenarios. In: 2020 IEEE 19th international symposium on network computing and applications. NCA); 2020. p. 1–8. https:// doi.org/10.1109/NCA51143.2020.9306730	spa
dc.relation.references	Cano I, Weimer M, Mahajan D, Curino C, Fumarola GM. Towards geo-distributed machine learning. 2016. arXiv preprint arXiv:1603.09035	spa
dc.relation.references	de Castro MR, dos Santos Tostes C, et al. SparkBLAST: scalable BLAST processing using in-memory operations. BMC Bioinf 2017;18. https://doi.org/10.1186/ s12859-017-1723-8	spa
dc.relation.references	Cattaneo G, Giancarlo R, et al. MapReduce in computational biology - a synopsis. 10.1007%2F978-3-319-57711-1_5. In: Advances in artificial life, evolutionary computation, and systems chemistry. Springer International Publishing; 2017. p. 53–64. URL	spa
dc.relation.references	Cattaneo G, Petrillo UF, Giancarlo R, Roscigno G. An effective extension of the applicability of alignment-free biological sequence comparison algorithms with hadoop. J Supercomput 2016;73:1467–83. https://doi.org/10.1007/s11227-016- 1835-3	spa
dc.relation.references	Chang YJ, Chen CC, Chen CL, Ho JM. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. In: BMC genomics, BioMed central; 2012. S28. https://doi.org/ 10.1186/1471-2164-13-S7-S28	spa
dc.relation.references	Chen Z, Hu J, Min G, Chen X. Effective data placement for scientific workflows in mobile edge computing using genetic particle swarm optimization. Concurrency Comput: Pract Ex 2019;e5413doi. https://doi.org/10.1002/cpe.5413	spa
dc.relation.references	Chervenak A, Deelman E, Foster I, Guy L, Hoschek W, Iamnitchi A, Kesselman C, Kunszt P, Ripeanu M, Schwartzkopf B, Stockinger H, Stockinger K, Tierney B. Giggle: a framework for constructing scalable replica location services. In: ACM/ IEEE SC 2002 conference (SC’02), IEEE; 2002. https://doi.org/10.1109/ sc.2002.10024	spa
dc.relation.references	Claerhout B, DeMoor G. Privacy protection for clinical and genomic data: the use of privacy-enhancing techniques in medicine. Int J Med Inf 2005;74:257–65.	spa
dc.relation.references	Cohen-Boulakia S, Belhajjame K, et al. Scientific workflows for computational reproducibility in the life sciences: status, challenges and opportunities. Future Generat Comput Syst 2017;75:284–98. https://doi.org/10.1016/j. future.2017.01.012	spa
dc.relation.references	Colosimo ME, Peterson MW, Mardis S, Hirschman L. Nephele: genotyping via complete composition vectors and MapReduce. Source Code Biol Med 2011;6. https://doi.org/10.1186/1751-0473-6-13	spa
dc.relation.references	Commission, E., Council. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data. http://data.europa.eu/eli/reg/2016/679/2016-05-04; 2016	spa
dc.relation.references	Congress of Colombia. Colombian data protection law. URL: https://www.fun cionpublica.gov.co/eva/gestornormativo/norma.php?i=49981. [Accessed 16 September 2021]	spa
dc.relation.references	Consortium DS, Consortium DM, Mahajan A, et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nature genetics 2014;46:234. https://doi.org/10.1038/ng.2897	spa
dc.relation.references	Cook CE, Lopez R, et al. The european bioinformatics institute in 2018: tools, infrastructure and training. Nucleic Acids Res 2018;47:D15–22. https://doi.org/ 10.1093/nar/gky1124	spa
dc.relation.references	Cope JM, Trebon N, Tufo HM, Beckman P. Robust data placement in urgent computing environments. In: 2009 IEEE international symposium on parallel & distributed processing. IEEE; 2009. p. 1–13. https://doi.org/10.1109/ IPDPS.2009.5160914	spa
dc.relation.references	Corpas M, Kovalevskaya NV, McMurray A, Nielsen FG. A fair guide for data providers to maximise sharing of human genomic data. PLoS Comput Biol 2018; 14:e1005873. https://doi.org/10.1371/journal.pcbi.1005873	spa
dc.relation.references	De Moor G, Claerhout B, De Meyer F. Privacy enhancing techniques. Method Inf Med 2003;42:148–53	spa
dc.relation.references	De Roure D, Belhajjam K, Missier P, G´ omez-P´ erez JM, Palma R, Ruiz JE, Hettne K, Roos M, Klyne G, Goble C. Towards the preservation of scientific workflows. In: iPRES 2011-8th international conference on preservation of digital objects. National Library Board Singapore and Nanyang Technology University; 2011. p. 228–31	spa
dc.relation.references	De Wit P, Pespeni MH, et al. The simple fool’s guide to population genomics via rna-seq: an introduction to high-throughput sequencing data analysis. Mol Eco Res 2012;12:1058–67. https://doi.org/10.1111/1755-0998.12003	spa
dc.relation.references	Decap D, Reumers J, Herzeel C, Costanza P, Fostier J. Halvade: scalable sequence analysis with MapReduce. Bioinformatics 2015;31:2482–8. https://doi.org/ 10.1093/bioinformatics/btv179	spa
dc.relation.references	Deelman E, Gannon D, et al. Workflows and e-science: an overview of workflow system features and capabilities. Future Generat Comput Syst 2009;25:528–40. https://doi.org/10.1016/j.future.2008.06.012	spa
dc.relation.references	Deelman E, Vahi K, et al. Pegasus, a workflow management system for science automation. Future Generat Comput Syst 2015;46:17–35. https://doi.org/ 10.1016/j.future.2014.10.008	spa
dc.relation.references	Dolev S, Florissi P, et al. A survey on geographically distributed big-data processing using MapReduce. IEEE Transact Big Data 2019;5:60–80. https://doi. org/10.1109/tbdata.2017.2723473	spa
dc.relation.references	Dong G, Fu X, Li H, Pan X. An accurate sequence assembly algorithm for livestock, plants and microorganism based on spark. Int J Pattern Recognit Artif Intell 2017; 31:1750024. https://doi.org/10.1142/s0218001417500240	spa
dc.relation.references	Ebrahimi M, Mohan A, Kashlev A, Lu S. Bdap: a big data placement strategy for cloud-based scientific workflows. In: 2015 IEEE first international conference on big data computing service and applications. IEEE; 2015. p. 105–14. https://doi. org/10.1109/BigDataService.2015.70	spa
dc.relation.references	Elmroth E, Hern´andez F, Tordsson J. Three fundamental dimensions of scientific workflow interoperability: model of computation, language, and execution environment. Future Generat Comput Syst 2010;26:245–56	spa
dc.relation.references	Fakas GJ, Karakostas B. A peer to peer (P2P) architecture for dynamic workflow management. Inf Software Technol 2004;46:423–31	spa
dc.relation.references	Fan J, Han F, Liu H. Challenges of big data analysis. Nat Sci Rev 2014;1:293–314. https://doi.org/10.1093/nsr/nwt032	spa
dc.relation.references	Federer LM, Lu YL, et al. Biomedical data sharing and reuse: attitudes and practices of clinical and scientific research staff. PLOS ONE 2015;10:e0129506. https://doi.org/10.1371/journal.pone.0129506	spa
dc.relation.references	Freire J, Bonnet P, Shasha D. Computational reproducibility: state-of-the-art, challenges, and database research opportunities. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data; 2012. p. 593–6	spa
dc.relation.references	Frye SV, Arkin MR, et al. Tackling reproducibility in academic preclinical drug discovery. Nat Rev Drug Discovery 2015;14:733–4. https://doi.org/10.1038/ nrd4737	spa
dc.relation.references	Gil Y, Ratnakar V, et al. Wings: intelligent workflow-based design of computational experiments. IEEE Intell Syst 2011;26:62–72. https://doi.org/ 10.1109/mis.2010.9	spa
dc.relation.references	Gilbert S, Lynch N. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 2002;33:51–9. https://doi.org/ 10.1145/564585.564601	spa
dc.relation.references	Goecks J, Nekrutenko A, Taylor J, Team TG. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010;11:R86. https://doi.org/10.1186/gb-2010- 11-8-r86.	spa
dc.relation.references	Goodman SN, Fanelli D, Ioannidis JPA. What does research reproducibility mean? Sci Translat Med 2016;8. https://doi.org/10.1126/scitranslmed.aaf5027. 341ps12–341ps12	spa
dc.relation.references	Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next- generation sequencing technologies. Nature Rev Genet 2016;17:333	spa
dc.relation.references	Guo R, Zhao Y, Zou Q, et al. Bioinformatics applications on Apache spark. GigaScience 2018. https://doi.org/10.1093/gigascience/giy098	spa
dc.relation.references	of Health NI, et al. Guidance: rigor and reproducibility in grant applications. 2017	spa
dc.relation.references	Huang H, Tata S, Prill RJ. BlueSNP: R package for highly scalable genome-wide association studies using hadoop clusters. Bioinformatics 2012;29:135–6. https:// doi.org/10.1093/bioinformatics/bts647	spa
dc.relation.references	Huang L, Krüger J, Sczyrba A. Analyzing large scale genomic data on the cloud with sparkhit. Bioinformatics 2017;34:1457–65. https://doi.org/10.1093/ bioinformatics/btx808	spa
dc.relation.references	Huang Y, Gottardo R. Comparability and reproducibility of biomedical data. Briefings Bioinfo 2012;14:391–401. https://doi.org/10.1093/bib/bbs078	spa
dc.relation.references	Hung CL, Lin YL, Hua GJ, Hu YC. CloudTSS: a TagSNP selection approach on cloud computing. In: Communications in computer and information science. Springer Berlin Heidelberg; 2011. p. 525–34. https://doi.org/10.1007/978-3- 642-27180-9_64	spa
dc.relation.references	Hutson S. Data handling errors spur debate over clinical trial. 618–618 Nature Med 2010;16. https://doi.org/10.1038/nm0610-618a	spa
dc.relation.references	Karim MR, Michel A, et al. Improving data workflow systems with cloud services and use of open data for bioinformatics research. Briefings Bioinfo 2017;19: 1035–50. https://doi.org/10.1093/bib/bbx039	spa
dc.relation.references	Khan A, Kim T, Byun H, Kim Y. Scispace: a scientific collaboration workspace for geo-distributed hpc data centers. Future Generat Comput Syst 2019;101:398–409.	spa
dc.relation.references	Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: a review of best practices and their practical application in cwlprov. GigaScience 2019;8:giz095	spa
dc.relation.references	Kim D, Vouk MA. Assessing run-time overhead of securing kepler. Procedia Comput Sci 2016;80:2281–6. https://doi.org/10.1016/j.procs.2016.05.412	spa
dc.relation.references	Kim JH. Genome data analysis. Springer Singapore; 2019. URL: https://www.sp ringer.com/gp/book/9789811319419	spa
dc.relation.references	Koster J, Rahmann S. Snakemake–a scalable bioinformatics workflow engine. Bioinfo 2012;28:2520–2. https://doi.org/10.1093/bioinformatics/bts480	spa
dc.relation.references	Kuhn K, et al. The cancer biomedical informatics grid (cabig): infrastructure and applications for a worldwide research community. Medinfo 2007;1:330	spa
dc.relation.references	Langmead B, Hansen KD, Leek JT. Cloud-scale RNA-sequencing differential expression analysis with myrna. Genome Biol 2010;11:R83. https://doi.org/ 10.1186/gb-2010-11-8-r83	spa
dc.relation.references	Langmead B, Schatz MC, et al. Searching for SNPs with cloud computing. Genome Biol 2009;10:R134. https://doi.org/10.1186/gb-2009-10-11-r134	spa
dc.relation.references	Legislature CS. The California consumer privacy act of. 2018. https://leginfo.legi slature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180SB1121	spa
dc.relation.references	Leo S, Santoni F, Zanetti G. Biodoop: bioinformatics on hadoop. In: 2009 international conference on parallel processing workshops. IEEE; 2009. https:// doi.org/10.1109/icppw.2009.37	spa
dc.relation.references	Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J. SNP detection for massively parallel whole-genome resequencing. Genome Res 2009;19:1124–32. https://doi.org/10.1101/gr.088013.108	spa
dc.relation.references	Li X, Zhang L, et al. A novel workflow-level data placement strategy for data- sharing scientific cloud workflows. IEEE Transact Serv Comput 2016. https://doi. org/10.1109/TSC.2016.2625247	spa
dc.relation.references	Liu J, Pacitti E, Valduriez P, Mattoso M. Parallelization of scientific workflows in the cloud. 2014	spa
dc.relation.references	Liu J, Pacitti E, Valduriez P, Mattoso M. A survey of data-intensive scientific workflow management. J Grid Comput 2015;13:457–93. https://doi.org/ 10.1007/s10723-015-9329-8	spa
dc.relation.references	Liu J, Pacitti E, Valduriez P, Mattoso M. Scientific workflow scheduling with provenance data in a multisite cloud. In: Transactions on large-scale data-and knowledge-centered systems XXXIII. Springer; 2017. p. 80–112	spa
dc.relation.references	Liu J, Pineda L, Pacitti E, Costan A, Valduriez P, Antoniu G, Mattoso M. Efficient scheduling of scientific workflows using hot metadata in a multisite cloud. IEEE Transact Knowl Data Eng 2019;31:1940–53. https://doi.org/10.1109/ tkde.2018.2867857	spa
dc.relation.references	Liu X, Datta A. Towards intelligent data placement for scientific workflows in collaborative cloud environment. In: 2011 IEEE international symposium on parallel and distributed processing workshops and phd forum. IEEE; 2011. p. 1052–61. https://doi.org/10.1109/IPDPS.2011.259	spa
dc.relation.references	Liu Y, Zhang L, Ge N, Li G. A systematic literature review on federated learning: from a model quality perspective. 2020. arXiv preprint arXiv:2012.01973	spa
dc.relation.references	Lu S, Zhang J. Collaborative scientific workflows supporting collaborative science. Int J Bus Process Integrat Manag 2011;5:185. https://doi.org/10.1504/ ijbpim.2011.040209	spa
dc.relation.references	Lu YY, Tang K, et al. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic acids research 2017;45:W554–9. https://doi.org/10.1093/nar/gkx351	spa
dc.relation.references	Malin BA, Emam KE, O’Keefe CM. Biomedical data privacy: problems, perspectives, and recent advances. 2013	spa
dc.relation.references	McKenna A, Hanna M, Banks E, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297–303. https://doi.org/10.1101/gr.107524.110	spa
dc.relation.references	McMahan B, Moore E, Ramage D, Hampson S, y Arcas BA. Communication- efficient learning of deep networks from decentralized data. In: Singh A, Zhu J, editors. Proceedings of the 20th international conference on artificial intelligence and statistics. Fort Lauderdale, FL, USA: PMLR; 2017. p. 1273–82. URL: http://pr oceedings.mlr.press/v54/mcmahan17a.html	spa
dc.relation.references	Moreau L, Missier P, Cheney J, Soiland-Reyes S. Prov-n: the provenance notation. 2013	spa
dc.relation.references	Nagappan M, Vouk MA. A model for sharing of confidential provenance information in a query based system. In: International provenance and annotation workshop. Springer; 2008. p. 62–9. https://doi.org/10.1007/978-3-540-89965-5_ 8	spa
dc.relation.references	Nguyen T, Shi W, Ruden D. CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 2011;4. https://doi.org/ 10.1186/1756-0500-4-171	spa
dc.relation.references	NHGRI-EBI. GWAS catalog. 2019. https://www.ebi.ac.uk/gwas/. accessed 20- Sept-2019	spa
dc.relation.references	NIH-BMIC. NIH data sharing repositories. 2019. https://www.nlm.nih.gov/NIH bmic/nih_data_sharing_repositories.html. accessed 20-Sept-2019	spa
dc.relation.references	Nordberg H, Bhatia K, Wang K, Wang Z. BioPig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 2013;29:3014–9. https://doi.org/ 10.1093/bioinformatics/btt528	spa
dc.relation.references	NSF, 2019. Chapter XI - Other Post Award Requirements and Consideration. https://www.nsf.gov/pubs/policydocs/pappg19_1/pappg_11.jsp\#XID4. [Online; accessed 20-June-2019]	spa
dc.relation.references	O’Brien AR, Saunders NFW, et al. VariantSpark: population scale clustering of genotype information. BMC Genom 2015;16. https://doi.org/10.1186/s12864- 015-2269-7	spa
dc.relation.references	Pandey RV, Schl¨otterer C. DistMap: a toolkit for distributed short read mapping on a hadoop cluster. PLoS ONE 2013;8:e72614. https://doi.org/10.1371/journal. pone.0072614	spa
dc.relation.references	Papageorgiou L, Eleni P, et al. Genomic big data hitting the storage bottleneck. EMBnetjournal 2018;24:e910. https://doi.org/10.14806/ej.24.0.910	spa
dc.relation.references	Parks R, Chu CH, Xu H. Healthcare information privacy research: iusses, gaps and what next? AMCIS; 2011	spa
dc.relation.references	Peteiro-Barral D, Guijarro-Berdi˜ nas B. A survey of methods for distributed machine learning. Prog Artif Intell 2013;2:1–11	spa
dc.relation.references	Pineda-Morales L, Costan A, Antoniu G. Towards multi-site metadata management for geographically distributed cloud workflows. In: 2015 IEEE international conference on cluster computing. IEEE; 2015. p. 294–303. https:// doi.org/10.1109/cluster.2015.49	spa
dc.relation.references	Pineda-Morales L, Liu J, Costan A, Pacitti E, Antoniu G, Valduriez P, Mattoso M. Managing hot metadata for scientific workflows on multisite clouds. In: 2016 IEEE international conference on big data (big data). IEEE; 2016. p. 390–7	spa
dc.relation.references	Pireddu L, Leo S, Zanetti G. SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 2011;27:2159–60. https://doi.org/10.1093/ bioinformatics/btr325	spa
dc.relation.references	Rasheed Z, Rangwala H. A map-reduce framework for clustering metagenomes. In: 2013 IEEE international symposium on parallel & distributed processing, workshops and phd forum. IEEE; 2013. https://doi.org/10.1109/ ipdpsw.2013.100	spa
dc.relation.references	Rasheed Z, Rangwala H. A map-reduce framework for clustering metagenomes. In: 2013 IEEE international symposium on parallel & distributed processing, workshops and phd forum. IEEE; 2013. https://doi.org/10.1109/ ipdpsw.2013.100	spa
dc.relation.references	Rodriguez MA, Buyya R. Scientific workflow management system for clouds. In: Software architecture for big data and the cloud. Elsevier; 2017. p. 367–87. https://doi.org/10.1016/b978-0-12-805467-3.00018-1	spa
dc.relation.references	Ross RB, Thakur R, et al. Pvfs: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference; 2000. p. 391–430	spa
dc.relation.references	Ross RB, Thakur R, et al. Pvfs: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference; 2000. p. 391–430	spa
dc.relation.references	Salloum S, Dautov R, et al. Big data analytics on Apache spark. Int J Data Sci Anal 2016;1:145–64. https://doi.org/10.1007/s41060-016-0027-9	spa
dc.relation.references	Santana-Perez I, P´ erez-Hern´ andez MS. Towards reproducibility in scientific workflows: an infrastructure-based approach. Scientific Program 2015:1–11. https://doi.org/10.1155/2015/243180	spa
dc.relation.references	Schadt EE, Linderman MD, et al. Computational solutions to large-scale data management and analysis. Nature Rev Genet 2010;11:647–57. https://doi.org/ 10.1038/nrg2857	spa
dc.relation.references	Schatz MC. BlastReduce: high performance short read mapping with MapReduce. University of Maryland; 2008. http://cgis.cs.umd.edu/Grad/scholarlypapers/pa pers/MichaelSchatz.pdf	spa
dc.relation.references	Schatz MC. BlastReduce: high performance short read mapping with MapReduce. University of Maryland; 2008. http://cgis.cs.umd.edu/Grad/scholarlypapers/pa pers/MichaelSchatz.pdf	spa
dc.relation.references	Schatz MC, Sommer D, Kelley D, Pop M. De novo assembly of large genomes using cloud computing. In: Proceedings of the cold spring harbor biology of genomes conference; 2010	spa
dc.relation.references	Schatz MC, Sommer D, Kelley D, Pop M. De novo assembly of large genomes using cloud computing. In: Proceedings of the cold spring harbor biology of genomes conference; 2010	spa
dc.relation.references	Senturk IF, Balakrishnan P, et al. A resource provisioning framework for bioinformatics applications in multi-cloud environments. Future Generat Comput Syst 2018;78:379–91. https://doi.org/10.1016/j.future.2016.06.008	spa
dc.relation.references	Sharov AA, Schlessinger D, Ko MSH. ExAtlas: an interactive online tool for meta- analysis of gene expression data. J Bioinfo Comput Biol 2015;13:1550019. https://doi.org/10.1142/s0219720015500195	spa
dc.relation.references	Soiland-Reyes S, Alper P, Goble C. Tracking workflow execution with tavernaprov. In: PROV: three tears later: Provenance Week 2016; 2016	spa
dc.relation.references	Stephens ZD, Lee SY, et al. Big data: astronomical or genomical? PLOS Biology 2015;13:e1002195. https://doi.org/10.1371/journal.pbio.1002195	spa
dc.relation.references	Tannenbaum T, Wright D, Miller K, Livny M. Condor: a distributed job scheduler. In: Beowulf cluster computing with windows; 2001. p. 307–50	spa
dc.relation.references	Taylor I, Shields M, Wang I, Harrison A. The triana workflow environment: architecture and applications. In: Workflows for e-Science. Springer; 2007. p. 320–39. https://doi.org/10.1007/978-1-84628-757-2_20	spa
dc.relation.references	Taylor IJ, Deelman E, et al. Workflows for e-Science: scientific workflows for grids, ume 1. Springer; 2007. https://doi.org/10.1007/978-1-84628-757-2	spa
dc.relation.references	Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the condor experience. Concurr Comput: Pract Exp 2005;17:323–56. https://doi.org/ 10.1002/cpe.938	spa
dc.relation.references	Tommaso PD, Chatzou M, et al. Nextflow enables reproducible computational workflows. Nature Biotechnol 2017;35:316–9. https://doi.org/10.1038/ nbt.3820	spa
dc.relation.references	Turakhia MP, Desai M, Hedlin H, Rajmane A, Talati N, Ferris T, Desai S, Nag D, Patel M, Kowey P, Rumsfeld JS, Russo AM, Hills MT, Granger CB, Mahaffey KW, Perez MV. Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: the apple heart study. Am Heart J 2019; 207:66–75. https://doi.org/10.1016/j.ahj.2018.09.002. https://www.sciencedi rect.com/science/article/pii/S0002870318302710.	spa
dc.relation.references	Union I. Communication from the commission to the european parliament, the council, the european economic and social committee and the committee of the regions. A new skills agenda for europe. 2014 [Brussels].	spa
dc.relation.references	Valduriez P, Mattoso M, Akbarinia R, Borges H, Camata J, Coutinho A, Gaspar D, Lemus N, Liu J, Lustosa H, et al. Scientific data analysis using data-intensive scalable computing: the scidisc project. In: LADaS: Latin America data science workshop, CEUR-WS. Org; 2018	spa
dc.relation.references	Van Hung T, Chuanhe H. An effective data placement strategy in main-memory database cluster. In: 2011 second international conference on networking and distributed computing. IEEE; 2011. p. 93–8. https://doi.org/10.1109/ ICNDC.2011.27.	spa
dc.relation.references	Verbraeken J, Wolting M, Katzy J, Kloppenburg J, Verbelen T, Rellermeyer JS. A survey on distributed machine learning. ACM Comput Surv (CSUR) 2020;53: 1–33	spa
dc.relation.references	Wang J, Crawl D, Altintas I. Kepler + hadoop. In: Proceedings of the 4th workshop on workflows in support of large-scale science - WORKS ’09. ACM Press; 2009. https://doi.org/10.1145/1645164.1645176	spa
dc.relation.references	Wang R, Li M, Peng L, Hu Y, Hassan MM, Alelaiwi A. Cognitive multi-agent empowering mobile edge computing for resource caching and collaboration. Future Generat Comput Syst 2020;102:66–74. https://doi.org/10.1016/j. future.2019.08.001. URL: https://www.sciencedirect.com/science/article/pii/ S0167739X19318783	spa
dc.relation.references	Wang Y. Automating experimentation with distributed systems using generative techniques. Ph.D. thesis. University of Colorado at Boulder; 2006	spa
dc.relation.references	Wang Y, Carzaniga A, Wolf AL. Four enhancements to automated distributed system experimentation methods. In: Proceedings of the 30th international conference on Software engineering; 2008. p. 491–500	spa
dc.relation.references	Wiewi´ orka MS, Messina A, et al. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 2014;30:2652–3. https://doi.org/10.1093/bioinformatics/btu343	spa
dc.relation.references	Wilde M, Hategan M, et al. Swift: a language for distributed parallel scripting. Parallel Comput 2011;37:633–52. https://doi.org/10.1016/j.parco.2011.05.005.	spa
dc.relation.references	Wolstencroft K, Haines R, et al. The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res 2013;41:W557–61. https://doi.org/10.1093/nar/gkt328	spa
dc.relation.references	Xiao Y, Zhou AC, Yang X, He B. Privacy-preserving workflow scheduling in geo- distributed data centers. Future Generat Comput Syst 2022;130:46–58	spa
dc.relation.references	Xie J, Yin S, et al. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE international symposium on parallel & distributed processing, workshops and phd forum (IPDPSW). IEEE; 2010. p. 1–9. https://doi.org/10.1109/IPDPSW.2010.547088	spa
dc.relation.references	Xie T. Sea: a striping-based energy-aware strategy for data placement in raid- structured storage systems. IEEE Transact Comput 2008;57:748–61. https://doi. org/10.1109/TC.2008.27	spa
dc.relation.references	Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y. Petuum: a new platform for distributed machine learning on big data. IEEE Transact Big Data 2015;1:49–67. https://doi.org/10.1109/tbdata.2015.2472014	spa
dc.relation.references	Xu B, Gao J, Li C. An efficient algorithm for DNA fragment assembly in MapReduce. Biochem Biophys Res Commun 2012;426:395–8. https://doi.org/ 10.1016/j.bbrc.2012.08.101	spa
dc.relation.references	Xu B, Li C, Zhuang H, et al. DSA: scalable distributed sequence alignment system using SIMD instructions. In: 2017 17th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), IEEE; 2017. https://doi.org/ 10.1109/ccgrid.2017.74	spa
dc.relation.references	Xu B, Li C, Zhuang H, et al. Efficient distributed smith-waterman algorithm based on Apache spark. In: 2017 IEEE 10th international conference on cloud computing (CLOUD). IEEE; 2017. https://doi.org/10.1109/cloud.2017.83	spa
dc.relation.references	Yu HF, Hsieh CJ, Chang KW, Lin CJ. Large linear classification when data cannot f it in memory. In: ACM Transactions on Knowledge Discovery from Data (TKDD); 2012. p. 1–23. 5	spa
dc.relation.references	Yu J, Buyya R. A taxonomy of workflow management systems for grid computing. J Grid Comput 2005;3:171–200. https://doi.org/10.1007/s10723-005-9010-8.	spa
dc.relation.references	Yuan D, Yang Y, Liu X, Chen J. A data placement strategy in scientific cloud workflows. Future Generat Comput Syst 2010;26:1200–14. https://doi.org/ 10.1016/j.future.2010.02.004	spa
dc.relation.references	Zhang D, Zhao L, Li B, et al. SEQSpark: a complete analysis tool for large-scale rare variant association studies using whole-genome and exome sequence data. The American J Human Genet 2017;101:115–22. https://doi.org/10.1016/j. ajhg.2017.05.017	spa
dc.relation.references	Zhang L, Gu S, Liu Y, Wang B, Azuaje F. Gene set analysis in the cloud. Bioinformatics 2011;28:294–5. https://doi.org/10.1093/bioinformatics/btr630	spa
dc.relation.references	Zhao G, Ling C, Sun D. SparkSW: scalable distributed computing system for large- scale biological sequence alignment. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, IEEE; 2015. https://doi.org/ 10.1109/ccgrid.2015.55	spa
dc.relation.references	Zhao J, Gomez-Perez JM, Belhajjame K, Klyne G, Garcia-Cuesta E, Garrido A, Hettne K, Roos M, De Roure D, Goble C. Why workflows break—understanding and combating decay in taverna workflows. In: 2012 ieee 8th international conference on e-science. IEEE; 2012. p. 1–9	spa
dc.relation.references	Zhao Q, Xiong, et al. A new energy-aware task scheduling method for data- intensive applications in the cloud. J Network Comput Appl 2016;59:14–27. https://doi.org/10.1016/j.jnca.2015.05.001	spa
dc.relation.references	Zhao Y, Li Y, Raicu I, Lu S, Tian W, Liu H. Enabling scalable scientific workflow management in the cloud. Future Generat Comput Syst 2015;46:3–16. https:// doi.org/10.1016/j.future.2014.10.023.	spa
dc.relation.references	Zhou W, Li R, Yuan S, et al. MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes. Bioinformatics 2017. https:// doi.org/10.1093/bioinformatics/btw750. btw750	spa
dc.relation.references	Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017;18. https://doi. org/10.1186/s13059-017-1319-7	spa
dc.relation.references	Zytnicki M, Quesneville H. S-MART, a software toolbox to aid RNA-seq data analysis. PLoS ONE 2011;6:e25988. https://doi.org/10.1371/journal. pone.0025988.	spa
dc.rights.accessrights	info:eu-repo/semantics/openAccess	spa
dc.subject.armarc	Biometría
dc.subject.armarc	Biometry
dc.subject.armarc	Análisis de la información
dc.subject.armarc	Information analysis
dc.subject.armarc	Investigación biomédica
dc.subject.armarc	Biomedical research
dc.subject.armarc	Tecnología médica
dc.subject.armarc	Medical technology
dc.subject.proposal	Distributed biomedical analyses	eng
dc.subject.proposal	Análisis biomédicos distribuidos	spa
dc.subject.proposal	Fully distributed collaborations	eng
dc.subject.proposal	Colaboraciones totalmente distribuidas	spa
dc.subject.proposal	Reproducibility	eng
dc.subject.proposal	Reproducibilidad	spa
dc.subject.proposal	Scalability Multi-site analyses	eng
dc.subject.proposal	Análisis de escalabilidad multisitio	spa
dc.subject.proposal	Distributed workflow analyses	eng
dc.subject.proposal	Análisis de flujo de trabajo distribuido	spa
dc.type.coar	http://purl.org/coar/resource_type/c_6501	spa
dc.type.content	Text	spa
dc.type.driver	info:eu-repo/semantics/article	spa

Ficheros en el ítem

Nombre:: A taxonomy of tools and approaches ...
Tamaño:: 1.784Mb
Formato:: PDF

Ver/

Este ítem aparece en la(s) siguiente(s) colección(ones)

AD - CTG – Informática [89]
Clasificación B- Convocatoria 2018

Mostrar el registro sencillo del ítem

A taxonomy of tools and approaches for distributed genomic analyses

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)

Envíos recientes