M - Doctorado en Ingenieríahttps://repositorio.escuelaing.edu.co/handle/001/10962024-03-28T08:34:51Z2024-03-28T08:34:51ZSecure distributed workflows for biomedical data analysisGarzón Alfonso, Wilmerhttps://repositorio.escuelaing.edu.co/handle/001/26082024-03-04T21:22:31Z2023-01-01T00:00:00ZSecure distributed workflows for biomedical data analysis
Garzón Alfonso, Wilmer
Resumen: En los últimos años, la cantidad de datos biomédicos recopilados y almacenados ha crecido significativamente. El análisis de estas grandes cantidades de datos ya no puede ser realizado por individuos u organizaciones individuales. Por lo tanto, la comunidad científica está creando esfuerzos de colaboración global para analizar estos datos. Sin embargo, los datos biomédicos están sujetos a varias restricciones legales y socioeconómicas que dificultan las posibilidades de colaboración en investigación.
En esta tesis, primero investigamos y mostramos que los investigadores requieren nue- vas herramientas y técnicas para abordar las restricciones y necesidades de las colaboraciones científicas globales sobre datos biomédicos geo distribuidos. En particular, identificamos tres tipos de restricciones relacionadas con las colaboraciones globales, a saber, restricciones técnicas, legales y socioeconómicas. También investigamos el estado del arte de las herramientas actuales para análisis biomédicos globales distribuidos, incluidas herramientas que utilizan técnicas de aprendizaje automático, y mostramos sus limitaciones.
A partir de estos hallazgos, proponemos colaboraciones totalmente distribuidas FDC (definido en inglés como Fully Distributed Collaborations), como esfuerzos de investigación que aprovechan los medios para explotar y analizar información biomédica de forma masiva y colaborativa respetando las restricciones legales y socioeconómicas. Nosotros investigamos el concepto, las propiedades y las características de los sistemas FDC, así como los requisitos de arquitectura y las necesidades de seguridad y privacidad. Como primer ejemplo del diseño de herramientas basadas en FDC, proponemos una estrategia de aprendizaje automático completamente distribuida. La estrategia considera un algoritmo de entrenamiento de bosque aleatorio donde varios sitios distribuidos geográficamente, mantienen sus propios datos privados, entrenan un modelo global en colaboración sin compartir información privada. El algoritmo propuesto, llamado MuSiForest, mejora con respecto a otros enfoques existentes de bosques multi-sitio al mejorar el tiempo de cómputo y reducir la cantidad de datos compartidos mientras tiene una precisión de entrenamiento cercana a la de las técnicas centralizadas de bosques aleatorios.
Finalmente, investigamos cómo los sistemas de flujo de trabajo se han utilizado ampliamente para especificar análisis de datos biomédicos y mostramos las limitaciones actuales de esas herramientas. Mostramos cómo ofrecen medios limitados para definir, implementar y ejecutar estudios de sitios múltiples en la infraestructura distribuida actual, respetando la propiedad de los datos y las restricciones de privacidad. A continuación, proponemos FeDeRa, un lenguaje para especificar, implementar y ejecutar flujos de trabanes legales y socioeconómicas. Nosotros investigamos el concepto, las propiedades y las características de los sistemas FDC, así como los requisitos de arquitectura y las necesidades de seguridad y privacidad. Como primer ejemplo del diseño de herramientas basadas en FDC, proponemos una estrategia de aprendizaje automático completamente distribuida. La estrategia considera un algoritmo de entrenamiento de bosque aleatorio donde varios sitios distribuidos geográficamente, mantienen sus propios datos privados, entrenan un modelo global en colaboración sin compartir información privada. El algoritmo propuesto, llamado MuSiForest, mejora con respecto a otros enfoques existentes de bosques multi-sitio al mejorar el tiempo de cómputo y reducir la cantidad de datos compartidos mientras tiene una precisión de entrenamiento cercana a la de las técnicas centralizadas de bosques aleatorios.
Finalmente, investigamos cómo los sistemas de flujo de trabajo se han utilizado ampliamente para especificar análisis de datos biomédicos y mostramos las limitaciones actuales de esas herramientas. Mostramos cómo ofrecen medios limitados para definir, implementar y ejecutar estudios de sitios múltiples en la infraestructura distribuida actual, respetando la propiedad de los datos y las restricciones de privacidad. A continuación, proponemos FeDeRa, un lenguaje para especificar, implementar y ejecutar flujos de trabajo científicos multi sitio compatibles con FDC.
El lenguaje está enriquecido con abstracciones para implementar análisis en infraestructuras de nube distribuidas geográficamente y con abstracciones para definir patrones de flujo de trabajo complejos a través de límites de múltiples sitios. FeDeRa admite la programación de flujo de datos y la concurrencia declarativa de forma nativa. También presentamos la implementación de un motor de tiempo de ejecución que admite la ejecución de flujos de trabajo y experimentos de FeDeRa implementados en la infraestructura de la nube.; In recent years, the amount of biomedical data collected and stored has grown significantly. Analysis of these large amounts of data can no longer be performed by individual individuals or organizations. Therefore, the scientific community is creating global collaborative efforts to analyze this data. However, biomedical data is subject to several legal and socioeconomic restrictions that hinder the possibilities of research collaboration. In this thesis, we first investigate and show that researchers require new tools and techniques to address the constraints and needs of global scientific collaborations on geo-distributed biomedical data. In particular, we identify three types of constraints related to global collaborations, namely, technical, legal, and socioeconomic constraints. We also investigate the state of the art of current tools for distributed global biomedical analyses, including tools using machine learning techniques, and show their limitations. Based on these findings, we propose Fully Distributed Collaborations (FDC), as research efforts that take advantage of the media to exploit and analyze biomedical information in a massive and collaborative way while respecting legal and socioeconomic restrictions. We investigate the concept, properties and characteristics of FDC systems, as well as architectural requirements and security and privacy needs. As a first example of designing FDC-based tools, we propose a fully distributed machine learning strategy. The strategy considers a random forest training algorithm where several geographically distributed sites, keeping their own private data, train a global model collaboratively without sharing private information. The proposed algorithm, called MuSiForest, improves on other existing multi-site forest approaches by improving computation time and reducing the amount of data sharing while having a training accuracy close to that of centralized random forest techniques. Finally, we investigate how workflow systems have been widely used to specify biomedical data analysis and show the current limitations of those tools. We show how they offer limited means to define, deploy, and run multi-site studies on today's distributed infrastructure, while respecting data ownership and privacy restrictions. Next, we propose FeDeRa, a language for specifying, implementing, and executing legal and socioeconomic workflows. We investigate the concept, properties, and characteristics of FDC systems, as well as architectural requirements, and security and privacy needs. As a first example of designing FDC-based tools, we propose a fully distributed machine learning strategy. The strategy considers a random forest training algorithm where several geographically distributed sites, keeping their own private data, train a global model collaboratively without sharing private information. The proposed algorithm, called MuSiForest, improves on other existing multi-site forest approaches by improving computation time and reducing the amount of data sharing while having a training accuracy close to that of centralized random forest techniques. Finally, we investigate how workflow systems have been widely used to specify biomedical data analysis and show the current limitations of those tools. We show how they offer limited means to define, deploy, and run multi-site studies on today's distributed infrastructure, while respecting data ownership and privacy restrictions. Next, we propose FeDeRa, a language for specifying, implementing, and executing FDC-compliant multi-site scientific workflows. The language is enriched with abstractions for implementing analytics across geographically distributed cloud infrastructures and with abstractions for defining complex workflow patterns across multi-site boundaries. FeDeRa natively supports dataflow programming and declarative concurrency. We also present the implementation of a runtime engine that supports running FeDeRa workflows and experiments deployed on cloud infrastructure.
2023-01-01T00:00:00ZWater quality assessment of hot springs and its discharges into the tributaries fed into the Bogotá riverSánchez Londoño, Yuly Andreahttps://repositorio.escuelaing.edu.co/handle/001/24912024-03-04T21:19:57Z2023-01-01T00:00:00ZWater quality assessment of hot springs and its discharges into the tributaries fed into the Bogotá river
Sánchez Londoño, Yuly Andrea
The hot springs are used for recreational, therapeutic, and medicinal purposes. Due to their mineralization and water temperature, they convey a feeling of well-being that makes them a great tourist attraction. However, bathers do not know the quality of the waters, so it is necessary to identify the possible risks to which bathers are exposed. In Colombia, hot springs are not disinfected before human use and their wastewater is discharged directly into rivers, nor is it customary to carry out monthly monitoring of their water quality. For this reason, it is important in Colombia to advance in the investigation of risk indicators for human health and the quality of aquatic ecosystems.
In this thesis, two types of hot springs that discharge to tributary water sources of the Bogotá river were studied. Its waters were monitored for six months to determine the physical, chemical, and microbiological quality of the water and to assess its compliance with drinking water, swimming pool, hot spring, and wastewater discharge standards. The results showed that each hot spring had its own unique water characteristics, with hot spring 1 being bicarbonate and hot spring 2 being ferruginous.
The values from the heavy metals analyzed (arsenic (As), chromium (Cr), mercury (Hg), lead (Pb), aluminum (Al), copper (Cu), iron (Fe), magnesium (Mg), manganese (Mn), nickel (Ni), zinc (Zn), strontium (Sr), and calcium (Ca)) in the samples from hot springs 1 and 2 were compared with regulations for drinking water and swimming pools. These values were within the limits established by the regulation, with the exception of the hot spring 2 featuring high iron values, which becomes more of a water aesthetic problem.
Parameter analysis revealed that hot spring 1 had high conductivities and pH values higher than 7.0, while hot spring 2 had lower conductivities and pH values between 6.0 and 7.7. Dissolved oxygen levels in both hot springs remained within acceptable ranges. Concentrations of heavy metals in the samples were found to be within the limits set by drinking water and swimming pool standards, except for the high iron values in hot spring 2, which posed mainly an aesthetic concern. Hot spring 2 exhibited high turbidity and color values due to the ferruginous properties of the water. Oil and fats were not quantifiable in any hot spring. Hot spring 1 presented calcium carbonate values above the regulatory limits for drinking water and swimming pools.
The results also indicated the presence of fecal coliforms (E. coli), total coliforms, Enterococci, L. pneumophila, P. aeruginosa, molds, and yeasts in all samples, exceeding the permissible limits.
The use of UV for disinfection of hot springs without the use of chemicals has shown promising results. The tests conducted in the laboratory demonstrated high removal efficiencies, reaching up to 99.9% for bacteria such as E. coli. This method preserves the physical and chemical properties of the water, ensuring that it remains in its natural state.However, it is important to note that UV disinfection can be costly, particularly for smaller hot springs.
Overall, this study provides valuable insights into the water quality of the analyzed hot springs and highlights the need for appropriate regulations and treatment measures to ensure the safety and compliance of hot springs with respect to human health and environmental protection. Further research and implementation of effective management strategies are necessary to address the specific water characteristics and contamination issues associated with hot springs in order to safeguard public health and preserve the integrity of aquatic ecosystems.
2023-01-01T00:00:00Z