Mostrar el registro sencillo del ítem
Revelando patrones arquitectónicos implícitos en Infraestructura como código a través de la transferencia de conocimiento de repositorio de código
dc.contributor.advisor | Garzón A, Wilmer | |
dc.contributor.advisor | Benavides Navarro, Luis Daniel | |
dc.contributor.author | Díaz Chica, Luis Felipe | |
dc.date.accessioned | 2023-10-03T20:32:29Z | |
dc.date.available | 2023 | |
dc.date.available | 2023-10-03T20:32:29Z | |
dc.date.issued | 2023 | |
dc.identifier.uri | https://repositorio.escuelaing.edu.co/handle/001/2623 | |
dc.description | We introduce the concept of ”implicit architec- tural patterns,” which we define as the knowledge related to architectural patterns that is not explicitly expressed in the code. We build a biased labeled dataset of 14000 files with modern cloud architectural patterns. We used the dataset and fine-tuning techniques to train CodeBERT, UnixCode, CodeT5, and RobERTA pre-trained LLM in code. The trained models achieved an F1-score of 96% on average. We generated a second unknown dataset for testing the fine-tuned models, revealing consistent predictions across the models. Notably, in their original state, the pre-trained models could not accurately identify and classify patterns. However, after applying fine-tunning the mod- els substantially improved the accuracy for classifying modern architectural patterns. We found that the most common patterns present in GitHub repositories are event-driven 34%, serverless 30%, object- storage 16% and microservices 10%. We used the analysis results to investigate further relationships between IaC components and cloud architectural patterns | eng |
dc.description.abstract | La infraestrucutura como código o por sus siglas en inglés IaC (Infrastructure as Code) es una modelo de gestión de recursos en la nube por medio de especificaciones de código. En nuestra investigación buscamos extraer conocimiento implícito de los proyectos de IaC relacionado a los patrones de arquitectura que están siendo utilizados en la comunidad de código libre. Para esto hemos realizado un análisis del estado del arte en temas relacionados con el análisis estático de código con modelos de lenguaje de gran envergadura también conocidos como Large Language Models en inglés(LLM), para posteriormente aplicar técnicas de transferencia de conocimiento a un conjunto de modelos pre-entrenados y categorizar los patrones de arquitectura encontrados en los proyectos de IaC. La transferencia de conocimiento es aplicada usando refinamiento (fine-tuning) y su- pervisado débil. Definimos un sistema de reglas que según los componentes de la infraestructura presente en el proyecto categorizamos un posible patrón de arqui- tectura. Este sistema de reglas es usado para construir un dataset inicial de 13200 archivos en 4 lenguajes de programación con sus respectivas etiquetas en 11 cate- gorías de patrones de arquitectura. Hemos logrado encontrar una mejora significativa en la categorización de los patrones de arquitectura después de aplicar transferencia de conocimiento a los modelos pre- entrenados en código. UnixCode y CodeBERT lograron alcanzar un F1-score 0.96% de precisión durante entrenamiento. Después de aplicar los modelos a un dataset desconocido encontramos que los patrones más usado son event-driven, serverless, microservicios y object storage dentro de la comunidad open source(Github). Tam- bién el lenguaje de programación predominante en Cloud Development Kit (CDK) es Typescript seguido por python. Logramos evidenciar un buen rendimiento en la clasificación de los patrones usando seq2seq como la técnica de representación del código y modelos pre-entrenados basados en RoBERTa. | spa |
dc.format.extent | 101 páginas | spa |
dc.format.mimetype | application/pdf | spa |
dc.language.iso | spa | spa |
dc.publisher | Escuela Colombian de Ingeniería | spa |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | spa |
dc.title | Revelando patrones arquitectónicos implícitos en Infraestructura como código a través de la transferencia de conocimiento de repositorio de código | eng |
dc.type | Trabajo de grado - Maestría | spa |
dc.type.version | info:eu-repo/semantics/publishedVersion | spa |
oaire.accessrights | http://purl.org/coar/access_right/c_abf2 | spa |
oaire.awardtitle | Revelando patrones arquitectónicos implícitos en Infraestructura como Código (IaC) a través de la transferencia de conocimientos del repositorio de código | spa |
oaire.version | http://purl.org/coar/version/c_970fb48d4fbd8a85 | spa |
dc.description.degreelevel | Maestría | spa |
dc.description.degreename | Magíster en Informática | spa |
dc.identifier.url | https://catalogo.escuelaing.edu.co/cgi-bin/koha/opac-detail.pl?biblionumber=23583 | |
dc.publisher.faculty | Ingeniería de Sistemas | spa |
dc.publisher.place | Bogotá | spa |
dc.publisher.program | Maestría en Informática | spa |
dc.relation.indexed | N/A | spa |
dc.relation.references | Ahmad, A., Jamshidi, P., Pahl, C., 2013. A framework for acquisi- tion and application of software architecture evolution knowledge URL | spa |
dc.relation.references | Alexander, C., Ishikawa, S., Silverstein, M., 1977. A Pattern Language: Towns, Buildings, Construction. Center for Environmental Structure Berkeley, Calif.: Center for Environmental Structure series, OUP USA | spa |
dc.relation.references | Alom, M.Z., Taha, T.M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M.S., Hasan, M., Van Essen, B.C., Awwal, A.A.S., Asari, V.K., 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8. | spa |
dc.relation.references | Alon, U., Brody, S., Levy, O., Yahav, E., 2019. code2seq: Generating sequences from structured representations of code. | spa |
dc.relation.references | Alon, U., Zilberstein, M., Levy, O., Yahav, E., 2018a. code2vec: Learning distributed representations of code. | spa |
dc.relation.references | Alon, U., Zilberstein, M., Levy, O., Yahav, E., 2018b. A general path-based repre- sentation for predicting program properties | spa |
dc.relation.references | Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., J., S., Fadhel, M.A., Al-Amidie, M., Farhan, L., 2021. Review of deep learning: Concepts, cnn architectures, challenges, applications, future directions - journal of big data. | spa |
dc.relation.references | he future of cloud development - Ampt — getampt.com. https://www. getampt.com/blog/introducing-ampt/. | spa |
dc.relation.references | Aviv, I., Gafni, R., Sherman, S., Aviv, B., Sterkin, A., Bega, E., 2023. Infrastructure from code: The next generation of cloud lifecycle automation. IEEE Software 40, 42–49. | spa |
dc.relation.references | Babar, M., Gorton, I., Jeffery, R., 2005. Capturing and using software architec- ture knowledge for architecture-based software development, in: Fifth Interna- tional Conference on Quality Software (QSIC’05), pp. 169–176. | spa |
dc.relation.references | Becker, M., Liang, S., Frank, A., 2021. Reconstructing implicit knowledge with language models, in: Workshop on Knowledge Extraction and Integration for Deep Learning Architectures; Deep Learning Inside Out. | spa |
dc.relation.references | Borovits, N., Kumara, I., Krishnan, P., Palma, S.D., Di Nucci, D., Palomba, F., Tamburri, D.A., van den Heuvel, W.J., 2020. Deepiac: Deep learning-based lin- guistic anti-pattern detection in iac, in: Proceedings of the 4th ACM SIGSOFT In- ternational Workshop on Machine-Learning Techniques for Software-Quality Eval- uation, | spa |
dc.relation.references | Borovits, N., Kumara, I., Krishnan, P., Palma, S.D., Di Nucci, D., Palomba, F., Tamburri, D.A., van den Heuvel, W.J., 2020. Deepiac: Deep learning-based lin- guistic anti-pattern detection in iac, in: Proceedings of the 4th ACM SIGSOFT In- ternational Workshop on Machine-Learning Techniques for Software-Quality Eval- uation, | spa |
dc.relation.references | Briem, J.A., Smit, J., Sellik, H., Rapoport, P., 2019. Using distributed representation of code for bug detection. | spa |
dc.relation.references | Brock, A., Lim, T., Ritchie, J.M., Weston, N., 2017. Freezeout: Accelerate training by progressively freezing layers | spa |
dc.relation.references | McCandlish, S., Radford, A., Sutskever, I., Amodei, D., 2020. Language models are few-shot learners. | spa |
dc.relation.references | Fine-tuning convolutional neu- ral networks for fine art classification. Expert Systems with Applications | spa |
dc.relation.references | Within-project defect prediction of infrastructure-as-code using product and process metrics. IEEE Transactions on Software Engineering 48, 2086–2104 | spa |
dc.relation.references | Dalla Palma, S., Di Nucci, D., Tamburri, D.A., 2020. Ansiblemetrics: A python library for measuring infrastructure-as-code blueprints in ansible. | spa |
dc.relation.references | De Lauretis, L., 2019. From monolithic architecture to microservices architecture, in: 2019 IEEE International Symposium on Software Reliability Engineering Work- shops | spa |
dc.relation.references | Du, X., Cai, Y., Wang, S., Zhang, L., 2016. Overview of deep learning, in: 2016 31st Youth Academic Annual Conference of Chinese Association of Automation | spa |
dc.relation.references | Fadlullah, Z.M., Tang, F., Mao, B., Kato, N., Akashi, O., Inoue, T., Mizutani, K., 2017. State-of-the-art deep learning: Evolving machine intelligence toward tomorrow’s intelligent network traffic control systems | spa |
dc.relation.references | Fehling, C., Leymann, F., Retter, R., Schupeck, W., Arbitter, P., 2014. Cloud computing patterns. 2014 ed., Springer, Vienna, Austria. | spa |
dc.relation.references | Feitosa, D., Penca, M.T., Berardi, M., Boza, R.D., Andrikopoulos, V., 2023. Mining for cost awareness in the infrastructure as code artifacts of cloud-based applica- tions: an exploratory study. | spa |
dc.relation.references | Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., Zhou, M., 2020. Codebert: A pre-trained model for programming and natural languages. | spa |
dc.relation.references | Galassi, A., Lippi, M., Torroni, P., 2021. Attention in natural language processing. IEEE Transactions on Neural Networks and Learning Systems | spa |
dc.relation.references | Gamma, E., Helm, R., Larman, C., Johnson, R., Vlissides, J., 2005. Valuepack: Design Patterns:Elements of Reusable Object-Oriented Software with Applying UML and Patterns:An Introduction to Object-Oriented Analysis and Design and References 83 Iterative Development. Addison Wesle | spa |
dc.relation.references | Georgousis, S., Kenning, M.P., Xie, X., 2021. Graph deep learning: State of the art and challenges. IEEE Access 9, 22106–22140 | spa |
dc.relation.references | Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. | spa |
dc.relation.references | Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., Chen, T., 2018. | spa |
dc.relation.references | Guerriero, M., Garriga, M., Tamburri, D.A., Palomba, F., 2019. Adoption, support, and challenges of infrastructure-as-code: Insights from industry, in: 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME) | spa |
dc.relation.references | Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., Yin, J., 2022. Unixcoder: Unified cross-modal pre-training for code representation | spa |
dc.relation.references | Clement, C., Drain, D., Sundare- san, N., Yin, J., Jiang, D., Zhou, M., 2021. Graphcodebert: Pre-training code representations with data flow | spa |
dc.relation.references | Hao, W., Bie, R., Guo, J., Meng, X., Wang, S., 2018. Optimized cnn based image recognition through target region selection. | spa |
dc.relation.references | Hasan, M.M., Bhuiyan, F.A., Rahman, A., 2020. Testing practices for infrastructure as code, in: Proceedings of the 1st ACM SIGSOFT International Workshop on Languages and Tools for Next-Generation Testing, Association for Computing Machinery, New York, NY, USA. p. 7–12 | spa |
dc.relation.references | Joshi, A.V., 2020. Amazon’s Machine Learning Toolkit: Sagemaker. Springer In- ternational Publishing, Cham. pp. 233–243. URL | spa |
dc.relation.references | Kagdi, H., Collard, M.L., Maletic, J.I., 2007. A survey and taxonomy of approaches for mining software repositories in the context of soft- ware evolution. Journal of Software Maintenance and Evolution: Re- search and Practice 19, 77–131. | spa |
dc.relation.references | Kaliyar, R.K., 2020. A multi-layer bidirectional transformer encoder for pre-trained word embedding: A survey of bert, in: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 336–340. | spa |
dc.relation.references | Karamanolakis, G., Mukherjee, S., Zheng, G., Awadallah, A.H., 2021. Self-training with weak supervision. CoRR abs/2104.05514 | spa |
dc.relation.references | arras, T., Aila, T., Laine, S., Lehtinen, J., 2017. Progressive growing of gans for improved quality, stability, and variation. CoRR abs/1710.10196. | spa |
dc.relation.references | Keery, S., Harber, C., Young, M., 2019. Implementing Cloud Design Patterns for AWS: Solutions and design ideas for solving system design problems. Packt Pub- lishing, Limited | spa |
dc.relation.references | Kovalenko, V., Bogomolov, E., Bryksin, T., Bacchelli, A., 2019. Pathminer: A library for mining of path-based representations of code, in: 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pp. 13– 17 | spa |
dc.relation.references | Land, L., Aurum, A., Handzic, M., 2001. Capturing implicit software engineering knowledge, in: Proceedings 2001 Australian Software Engineering Conference, pp. 108–114. | spa |
dc.relation.references | Linthicum, D.S., 2017. Cloud-native applications and cloud migration: The good, the bad, and the points between. IEEE Cloud Computing 4, 12–14 | spa |
dc.relation.references | Liu, Y., Agarwal, S., Venkataraman, S., 2021. Autofreeze: Automatically freezing model blocks to accelerate fine-tuning | spa |
dc.relation.references | Maffort, C., Valente, M.T., Bigonha, M., Hora, A., Anquetil, N., Menezes, J., 2013. Mining Architectural Patterns Using Association Rules, in: International Con- ference on Software Engineering and Knowledge Engineering (SEKE’13), Boston, United States | spa |
dc.relation.references | Mistrik, I., Bahsoon, R., Ali, N., Heisel, M., Maxim, B., 2017. Software architecture for Big Data and the cloud. | spa |
dc.relation.references | Niu, C., Li, C., Ng, V., Ge, J., Huang, L., Luo, B., 2022. Spt-code: Sequence-to- sequence pre-training for learning source code representations, in: Proceedings of the 44th International Conference on Software Engineering, Association for Computing Machinery, New York, NY, USA. p. 2006–2018 | spa |
dc.relation.references | Opdebeeck, R., Zerouali, A., Velázquez-Rodríguez, C., De Roover, C., 2021. On the practice of semantic versioning for ansible galaxy roles: An empiri- cal study and a change classification model. Journal of Systems and Software 182, 111059 | spa |
dc.relation.references | Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., Ward, R., 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 694–707. | spa |
dc.relation.references | Perez., Q., Borgne., A.L., Urtado., C., Vauttier., S., 2021. Towards profiling runtime architecture code contributors in software projects, in: Proceedings of the 16th International Conference on Evaluation of Novel Approaches to Soft- ware Engineering | spa |
dc.relation.references | Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019a. Language models are unsupervised multitask learners. | spa |
dc.relation.references | Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al., 2019b. Language models are unsupervised multitask learners. OpenAI blog 1, 9. | spa |
dc.relation.references | Rahman, A., Mahdavi-Hezaveh, R., Williams, L., 2019. A systematic mapping study of infrastructure as code research. Information and Software Technology 108, 65–77 | spa |
dc.relation.references | The RedMonk Programming Language Rankings: Jan- uary 2023 — redmonk.com | spa |
dc.relation.references | Rühling Cachay, S., Boecking, B., Dubrawski, A., 2021. End-to-end weak supervi- sion, in: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 1845–1857 | spa |
dc.relation.references | Salehinejad, H., Sankar, S., Barfett, J., Colak, E., Valaee, S., 2018. Recent advances in recurrent neural networks. | spa |
dc.relation.references | Savidis, A., Savvaki, K., 2021. Software architecture mining from source code with dependency graph clustering and visualization | spa |
dc.relation.references | Schmidt, F., MacDonell, S.G., Connor, A.M., 2014. An automatic architecture re- construction and refactoring framework, in: International Conference on Software Engineering Research and Applications. | spa |
dc.relation.references | Schuster, M., Paliwal, K., 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 2673–2681. | spa |
dc.relation.references | Sehovac, L., Grolinger, K., 2020. Deep learning for load forecasting: Sequence to sequence recurrent neural networks with attention. IEEE Access 8, 36411–36426 | spa |
dc.relation.references | Sharma, A., Kumar, M., Agarwal, S., 2015. A complete survey on software archi- tectural styles and patterns. Procedia Computer Science 70, 16–28 | spa |
dc.relation.references | Sharma, S., Sharma, S., Athaiya, A., 2017. Activation functions in neural networks. Towards Data Sci 6, 310–316 | spa |
dc.relation.references | Shin, C., Li, W., Vishwakarma, H., Roberts, N.C., Sala, F., 2021. Universalizing weak supervision. CoRR abs/2112.03865 | spa |
dc.relation.references | Shrestha, A., Mahmood, A., 2019. Review of deep learning algorithms and archi- tectures. IEEE Access 7, 53040–53065. | spa |
dc.relation.references | Siow, J.K., Liu, S., Xie, X., Meng, G., Liu, Y., 2022. Learning program seman- tics with code representations: An empirical study, in: 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. | spa |
dc.relation.references | Smite, D., Moe, N.B., Levinta, G., Floryan, M., 2019. Spotify guilds: How to succeed with knowledge sharing in large-scale agile organizations. IEEE Software 36, 51–57. | spa |
dc.relation.references | Sriram, A., Jun, H., Satheesh, S., Coates, A., 2017. Cold fusion: Training seq2seq models together with language models. | spa |
dc.relation.references | Sundararaman, D., Subramanian, V., Wang, G., Si, S., Shen, D., Wang, D., Carin, L., 2019. Syntax-infused transformer and bert models for machine translation and natural language understanding. | spa |
dc.relation.references | Taibi, D., El Ioini, N., Pahl, C., Niederkofler, J.R.S., 2020. Serverless cloud com- puting (function-as-a-service) patterns: A multivocal literature review, in: Pro- ceedings of the 10th International Conference on Cloud Computing and Services Science (CLOSER’20) | spa |
dc.relation.references | Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need | spa |
dc.relation.references | Wan Mohd Isa, W.A.R., Suhaimi, A.I.H., Noordin, N., Harun, A., Ismail, J., Teh, R., 2019. Cloud computing adoption reference model. Indonesian Journal of Electrical Engineering and Computer Science 16, 395. | spa |
dc.relation.references | Wang, Y., Wang, W., Joty, S., Hoi, S.C.H., 2021. Codet5: Identifier-aware uni- fied pre-trained encoder-decoder models for code understanding and generation. | spa |
dc.relation.references | Washizaki, H., Ogata, S., Hazeyama, A., Okubo, T., Fernandez, E.B., Yoshioka, N., 2020. Landscape of architecture and design patterns for iot systems. IEEE Internet of Things Journal 7, 10091–10101 | spa |
dc.relation.references | Yussupov, V., Soldani, J., Breitenbücher, U., Brogi, A., Leymann, F., 2021. From serverful to serverless: A spectrum of patterns for hosting application components, pp. 268–279 | spa |
dc.relation.references | Zeng, C., Yu, Y., Li, S., Xia, X., Wang, Z., Geng, M., Xiao, B., Dong, W., Liao, X., 2021. degraphcs: Embedding variable-based flow graph for neural code search | spa |
dc.relation.references | Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X., 2019. A novel neural source code representation based on abstract syntax tree, in: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794 | spa |
dc.relation.references | Zhang, X., Fan, J., Hei, M., 2022. Compressing bert for binary text classification via adaptive truncation before fine-tuning. Applied Sciences 12. | spa |
dc.relation.references | Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q., 2021. A comprehensive survey on transfer learning. Proceedings of the IEEE 109, 43–76. | spa |
dc.rights.accessrights | info:eu-repo/semantics/openAccess | spa |
dc.rights.creativecommons | Atribución 4.0 Internacional (CC BY 4.0) | spa |
dc.subject.armarc | Infraestructura como código | |
dc.subject.armarc | Conocimiento implícito | |
dc.subject.armarc | Transferencia de conocimiento | |
dc.subject.armarc | Patrones de arquitectura | |
dc.subject.armarc | Modelos de lenguaje | |
dc.subject.proposal | Infraestructura como código | spa |
dc.subject.proposal | Conocimiento implícito | spa |
dc.subject.proposal | Transferencia de conocimiento | spa |
dc.subject.proposal | Patrones de arquitectura | spa |
dc.subject.proposal | Modelos de lenguaje | spa |
dc.subject.proposal | Infrastructure as code | eng |
dc.subject.proposal | Implicit knowledge | eng |
dc.subject.proposal | Knowledge transfer | eng |
dc.subject.proposal | Architecture patterns | eng |
dc.subject.proposal | Language models | eng |
dc.type.coar | http://purl.org/coar/resource_type/c_bdcc | spa |
dc.type.content | Text | spa |
dc.type.driver | info:eu-repo/semantics/masterThesis | spa |
dc.type.redcol | https://purl.org/redcol/resource_type/TM | spa |