Revelando patrones arquitectónicos implícitos en Infraestructura como código a través de la transferencia de conocimiento de repositorio de código

Díaz Chica, Luis Felipe

dc.contributor.advisor	Garzón A, Wilmer
dc.contributor.advisor	Benavides Navarro, Luis Daniel
dc.contributor.author	Díaz Chica, Luis Felipe
dc.date.accessioned	2023-10-03T20:32:29Z
dc.date.available	2023
dc.date.available	2023-10-03T20:32:29Z
dc.date.issued	2023
dc.identifier.uri	https://repositorio.escuelaing.edu.co/handle/001/2623
dc.description	We introduce the concept of ”implicit architec- tural patterns,” which we define as the knowledge related to architectural patterns that is not explicitly expressed in the code. We build a biased labeled dataset of 14000 files with modern cloud architectural patterns. We used the dataset and fine-tuning techniques to train CodeBERT, UnixCode, CodeT5, and RobERTA pre-trained LLM in code. The trained models achieved an F1-score of 96% on average. We generated a second unknown dataset for testing the fine-tuned models, revealing consistent predictions across the models. Notably, in their original state, the pre-trained models could not accurately identify and classify patterns. However, after applying fine-tunning the mod- els substantially improved the accuracy for classifying modern architectural patterns. We found that the most common patterns present in GitHub repositories are event-driven 34%, serverless 30%, object- storage 16% and microservices 10%. We used the analysis results to investigate further relationships between IaC components and cloud architectural patterns	eng
dc.description.abstract	La infraestrucutura como código o por sus siglas en inglés IaC (Infrastructure as Code) es una modelo de gestión de recursos en la nube por medio de especificaciones de código. En nuestra investigación buscamos extraer conocimiento implícito de los proyectos de IaC relacionado a los patrones de arquitectura que están siendo utilizados en la comunidad de código libre. Para esto hemos realizado un análisis del estado del arte en temas relacionados con el análisis estático de código con modelos de lenguaje de gran envergadura también conocidos como Large Language Models en inglés(LLM), para posteriormente aplicar técnicas de transferencia de conocimiento a un conjunto de modelos pre-entrenados y categorizar los patrones de arquitectura encontrados en los proyectos de IaC. La transferencia de conocimiento es aplicada usando refinamiento (fine-tuning) y su- pervisado débil. Definimos un sistema de reglas que según los componentes de la infraestructura presente en el proyecto categorizamos un posible patrón de arqui- tectura. Este sistema de reglas es usado para construir un dataset inicial de 13200 archivos en 4 lenguajes de programación con sus respectivas etiquetas en 11 cate- gorías de patrones de arquitectura. Hemos logrado encontrar una mejora significativa en la categorización de los patrones de arquitectura después de aplicar transferencia de conocimiento a los modelos pre- entrenados en código. UnixCode y CodeBERT lograron alcanzar un F1-score 0.96% de precisión durante entrenamiento. Después de aplicar los modelos a un dataset desconocido encontramos que los patrones más usado son event-driven, serverless, microservicios y object storage dentro de la comunidad open source(Github). Tam- bién el lenguaje de programación predominante en Cloud Development Kit (CDK) es Typescript seguido por python. Logramos evidenciar un buen rendimiento en la clasificación de los patrones usando seq2seq como la técnica de representación del código y modelos pre-entrenados basados en RoBERTa.	spa
dc.format.extent	101 páginas	spa
dc.format.mimetype	application/pdf	spa
dc.language.iso	spa	spa
dc.publisher	Escuela Colombian de Ingeniería	spa
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	spa
dc.title	Revelando patrones arquitectónicos implícitos en Infraestructura como código a través de la transferencia de conocimiento de repositorio de código	eng
dc.type	Trabajo de grado - Maestría	spa
dc.type.version	info:eu-repo/semantics/publishedVersion	spa
oaire.accessrights	http://purl.org/coar/access_right/c_abf2	spa
oaire.awardtitle	Revelando patrones arquitectónicos implícitos en Infraestructura como Código (IaC) a través de la transferencia de conocimientos del repositorio de código	spa
oaire.version	http://purl.org/coar/version/c_970fb48d4fbd8a85	spa
dc.description.degreelevel	Maestría	spa
dc.description.degreename	Magíster en Informática	spa
dc.identifier.url	https://catalogo.escuelaing.edu.co/cgi-bin/koha/opac-detail.pl?biblionumber=23583
dc.publisher.faculty	Ingeniería de Sistemas	spa
dc.publisher.place	Bogotá	spa
dc.publisher.program	Maestría en Informática	spa
dc.relation.indexed	N/A	spa
dc.relation.references	Ahmad, A., Jamshidi, P., Pahl, C., 2013. A framework for acquisi- tion and application of software architecture evolution knowledge URL	spa
dc.relation.references	Alexander, C., Ishikawa, S., Silverstein, M., 1977. A Pattern Language: Towns, Buildings, Construction. Center for Environmental Structure Berkeley, Calif.: Center for Environmental Structure series, OUP USA	spa
dc.relation.references	Alom, M.Z., Taha, T.M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M.S., Hasan, M., Van Essen, B.C., Awwal, A.A.S., Asari, V.K., 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8.	spa
dc.relation.references	Alon, U., Brody, S., Levy, O., Yahav, E., 2019. code2seq: Generating sequences from structured representations of code.	spa
dc.relation.references	Alon, U., Zilberstein, M., Levy, O., Yahav, E., 2018a. code2vec: Learning distributed representations of code.	spa
dc.relation.references	Alon, U., Zilberstein, M., Levy, O., Yahav, E., 2018b. A general path-based repre- sentation for predicting program properties	spa
dc.relation.references	Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., J., S., Fadhel, M.A., Al-Amidie, M., Farhan, L., 2021. Review of deep learning: Concepts, cnn architectures, challenges, applications, future directions - journal of big data.	spa
dc.relation.references	he future of cloud development - Ampt — getampt.com. https://www. getampt.com/blog/introducing-ampt/.	spa
dc.relation.references	Aviv, I., Gafni, R., Sherman, S., Aviv, B., Sterkin, A., Bega, E., 2023. Infrastructure from code: The next generation of cloud lifecycle automation. IEEE Software 40, 42–49.	spa
dc.relation.references	Babar, M., Gorton, I., Jeffery, R., 2005. Capturing and using software architec- ture knowledge for architecture-based software development, in: Fifth Interna- tional Conference on Quality Software (QSIC’05), pp. 169–176.	spa
dc.relation.references	Becker, M., Liang, S., Frank, A., 2021. Reconstructing implicit knowledge with language models, in: Workshop on Knowledge Extraction and Integration for Deep Learning Architectures; Deep Learning Inside Out.	spa
dc.relation.references	Borovits, N., Kumara, I., Krishnan, P., Palma, S.D., Di Nucci, D., Palomba, F., Tamburri, D.A., van den Heuvel, W.J., 2020. Deepiac: Deep learning-based lin- guistic anti-pattern detection in iac, in: Proceedings of the 4th ACM SIGSOFT In- ternational Workshop on Machine-Learning Techniques for Software-Quality Eval- uation,	spa
dc.relation.references	Borovits, N., Kumara, I., Krishnan, P., Palma, S.D., Di Nucci, D., Palomba, F., Tamburri, D.A., van den Heuvel, W.J., 2020. Deepiac: Deep learning-based lin- guistic anti-pattern detection in iac, in: Proceedings of the 4th ACM SIGSOFT In- ternational Workshop on Machine-Learning Techniques for Software-Quality Eval- uation,	spa
dc.relation.references	Briem, J.A., Smit, J., Sellik, H., Rapoport, P., 2019. Using distributed representation of code for bug detection.	spa
dc.relation.references	Brock, A., Lim, T., Ritchie, J.M., Weston, N., 2017. Freezeout: Accelerate training by progressively freezing layers	spa
dc.relation.references	McCandlish, S., Radford, A., Sutskever, I., Amodei, D., 2020. Language models are few-shot learners.	spa
dc.relation.references	Fine-tuning convolutional neu- ral networks for fine art classification. Expert Systems with Applications	spa
dc.relation.references	Within-project defect prediction of infrastructure-as-code using product and process metrics. IEEE Transactions on Software Engineering 48, 2086–2104	spa
dc.relation.references	Dalla Palma, S., Di Nucci, D., Tamburri, D.A., 2020. Ansiblemetrics: A python library for measuring infrastructure-as-code blueprints in ansible.	spa
dc.relation.references	De Lauretis, L., 2019. From monolithic architecture to microservices architecture, in: 2019 IEEE International Symposium on Software Reliability Engineering Work- shops	spa
dc.relation.references	Du, X., Cai, Y., Wang, S., Zhang, L., 2016. Overview of deep learning, in: 2016 31st Youth Academic Annual Conference of Chinese Association of Automation	spa
dc.relation.references	Fadlullah, Z.M., Tang, F., Mao, B., Kato, N., Akashi, O., Inoue, T., Mizutani, K., 2017. State-of-the-art deep learning: Evolving machine intelligence toward tomorrow’s intelligent network traffic control systems	spa
dc.relation.references	Fehling, C., Leymann, F., Retter, R., Schupeck, W., Arbitter, P., 2014. Cloud computing patterns. 2014 ed., Springer, Vienna, Austria.	spa
dc.relation.references	Feitosa, D., Penca, M.T., Berardi, M., Boza, R.D., Andrikopoulos, V., 2023. Mining for cost awareness in the infrastructure as code artifacts of cloud-based applica- tions: an exploratory study.	spa
dc.relation.references	Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., Zhou, M., 2020. Codebert: A pre-trained model for programming and natural languages.	spa
dc.relation.references	Galassi, A., Lippi, M., Torroni, P., 2021. Attention in natural language processing. IEEE Transactions on Neural Networks and Learning Systems	spa
dc.relation.references	Gamma, E., Helm, R., Larman, C., Johnson, R., Vlissides, J., 2005. Valuepack: Design Patterns:Elements of Reusable Object-Oriented Software with Applying UML and Patterns:An Introduction to Object-Oriented Analysis and Design and References 83 Iterative Development. Addison Wesle	spa
dc.relation.references	Georgousis, S., Kenning, M.P., Xie, X., 2021. Graph deep learning: State of the art and challenges. IEEE Access 9, 22106–22140	spa
dc.relation.references	Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014.	spa
dc.relation.references	Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., Chen, T., 2018.	spa
dc.relation.references	Guerriero, M., Garriga, M., Tamburri, D.A., Palomba, F., 2019. Adoption, support, and challenges of infrastructure-as-code: Insights from industry, in: 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)	spa
dc.relation.references	Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., Yin, J., 2022. Unixcoder: Unified cross-modal pre-training for code representation	spa
dc.relation.references	Clement, C., Drain, D., Sundare- san, N., Yin, J., Jiang, D., Zhou, M., 2021. Graphcodebert: Pre-training code representations with data flow	spa
dc.relation.references	Hao, W., Bie, R., Guo, J., Meng, X., Wang, S., 2018. Optimized cnn based image recognition through target region selection.	spa
dc.relation.references	Hasan, M.M., Bhuiyan, F.A., Rahman, A., 2020. Testing practices for infrastructure as code, in: Proceedings of the 1st ACM SIGSOFT International Workshop on Languages and Tools for Next-Generation Testing, Association for Computing Machinery, New York, NY, USA. p. 7–12	spa
dc.relation.references	Joshi, A.V., 2020. Amazon’s Machine Learning Toolkit: Sagemaker. Springer In- ternational Publishing, Cham. pp. 233–243. URL	spa
dc.relation.references	Kagdi, H., Collard, M.L., Maletic, J.I., 2007. A survey and taxonomy of approaches for mining software repositories in the context of soft- ware evolution. Journal of Software Maintenance and Evolution: Re- search and Practice 19, 77–131.	spa
dc.relation.references	Kaliyar, R.K., 2020. A multi-layer bidirectional transformer encoder for pre-trained word embedding: A survey of bert, in: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 336–340.	spa
dc.relation.references	Karamanolakis, G., Mukherjee, S., Zheng, G., Awadallah, A.H., 2021. Self-training with weak supervision. CoRR abs/2104.05514	spa
dc.relation.references	arras, T., Aila, T., Laine, S., Lehtinen, J., 2017. Progressive growing of gans for improved quality, stability, and variation. CoRR abs/1710.10196.	spa
dc.relation.references	Keery, S., Harber, C., Young, M., 2019. Implementing Cloud Design Patterns for AWS: Solutions and design ideas for solving system design problems. Packt Pub- lishing, Limited	spa
dc.relation.references	Kovalenko, V., Bogomolov, E., Bryksin, T., Bacchelli, A., 2019. Pathminer: A library for mining of path-based representations of code, in: 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pp. 13– 17	spa
dc.relation.references	Land, L., Aurum, A., Handzic, M., 2001. Capturing implicit software engineering knowledge, in: Proceedings 2001 Australian Software Engineering Conference, pp. 108–114.	spa
dc.relation.references	Linthicum, D.S., 2017. Cloud-native applications and cloud migration: The good, the bad, and the points between. IEEE Cloud Computing 4, 12–14	spa
dc.relation.references	Liu, Y., Agarwal, S., Venkataraman, S., 2021. Autofreeze: Automatically freezing model blocks to accelerate fine-tuning	spa
dc.relation.references	Maffort, C., Valente, M.T., Bigonha, M., Hora, A., Anquetil, N., Menezes, J., 2013. Mining Architectural Patterns Using Association Rules, in: International Con- ference on Software Engineering and Knowledge Engineering (SEKE’13), Boston, United States	spa
dc.relation.references	Mistrik, I., Bahsoon, R., Ali, N., Heisel, M., Maxim, B., 2017. Software architecture for Big Data and the cloud.	spa
dc.relation.references	Niu, C., Li, C., Ng, V., Ge, J., Huang, L., Luo, B., 2022. Spt-code: Sequence-to- sequence pre-training for learning source code representations, in: Proceedings of the 44th International Conference on Software Engineering, Association for Computing Machinery, New York, NY, USA. p. 2006–2018	spa
dc.relation.references	Opdebeeck, R., Zerouali, A., Velázquez-Rodríguez, C., De Roover, C., 2021. On the practice of semantic versioning for ansible galaxy roles: An empiri- cal study and a change classification model. Journal of Systems and Software 182, 111059	spa
dc.relation.references	Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., Ward, R., 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 694–707.	spa
dc.relation.references	Perez., Q., Borgne., A.L., Urtado., C., Vauttier., S., 2021. Towards profiling runtime architecture code contributors in software projects, in: Proceedings of the 16th International Conference on Evaluation of Novel Approaches to Soft- ware Engineering	spa
dc.relation.references	Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019a. Language models are unsupervised multitask learners.	spa
dc.relation.references	Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al., 2019b. Language models are unsupervised multitask learners. OpenAI blog 1, 9.	spa
dc.relation.references	Rahman, A., Mahdavi-Hezaveh, R., Williams, L., 2019. A systematic mapping study of infrastructure as code research. Information and Software Technology 108, 65–77	spa
dc.relation.references	The RedMonk Programming Language Rankings: Jan- uary 2023 — redmonk.com	spa
dc.relation.references	Rühling Cachay, S., Boecking, B., Dubrawski, A., 2021. End-to-end weak supervi- sion, in: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 1845–1857	spa
dc.relation.references	Salehinejad, H., Sankar, S., Barfett, J., Colak, E., Valaee, S., 2018. Recent advances in recurrent neural networks.	spa
dc.relation.references	Savidis, A., Savvaki, K., 2021. Software architecture mining from source code with dependency graph clustering and visualization	spa
dc.relation.references	Schmidt, F., MacDonell, S.G., Connor, A.M., 2014. An automatic architecture re- construction and refactoring framework, in: International Conference on Software Engineering Research and Applications.	spa
dc.relation.references	Schuster, M., Paliwal, K., 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 2673–2681.	spa
dc.relation.references	Sehovac, L., Grolinger, K., 2020. Deep learning for load forecasting: Sequence to sequence recurrent neural networks with attention. IEEE Access 8, 36411–36426	spa
dc.relation.references	Sharma, A., Kumar, M., Agarwal, S., 2015. A complete survey on software archi- tectural styles and patterns. Procedia Computer Science 70, 16–28	spa
dc.relation.references	Sharma, S., Sharma, S., Athaiya, A., 2017. Activation functions in neural networks. Towards Data Sci 6, 310–316	spa
dc.relation.references	Shin, C., Li, W., Vishwakarma, H., Roberts, N.C., Sala, F., 2021. Universalizing weak supervision. CoRR abs/2112.03865	spa
dc.relation.references	Shrestha, A., Mahmood, A., 2019. Review of deep learning algorithms and archi- tectures. IEEE Access 7, 53040–53065.	spa
dc.relation.references	Siow, J.K., Liu, S., Xie, X., Meng, G., Liu, Y., 2022. Learning program seman- tics with code representations: An empirical study, in: 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp.	spa
dc.relation.references	Smite, D., Moe, N.B., Levinta, G., Floryan, M., 2019. Spotify guilds: How to succeed with knowledge sharing in large-scale agile organizations. IEEE Software 36, 51–57.	spa
dc.relation.references	Sriram, A., Jun, H., Satheesh, S., Coates, A., 2017. Cold fusion: Training seq2seq models together with language models.	spa
dc.relation.references	Sundararaman, D., Subramanian, V., Wang, G., Si, S., Shen, D., Wang, D., Carin, L., 2019. Syntax-infused transformer and bert models for machine translation and natural language understanding.	spa
dc.relation.references	Taibi, D., El Ioini, N., Pahl, C., Niederkofler, J.R.S., 2020. Serverless cloud com- puting (function-as-a-service) patterns: A multivocal literature review, in: Pro- ceedings of the 10th International Conference on Cloud Computing and Services Science (CLOSER’20)	spa
dc.relation.references	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need	spa
dc.relation.references	Wan Mohd Isa, W.A.R., Suhaimi, A.I.H., Noordin, N., Harun, A., Ismail, J., Teh, R., 2019. Cloud computing adoption reference model. Indonesian Journal of Electrical Engineering and Computer Science 16, 395.	spa
dc.relation.references	Wang, Y., Wang, W., Joty, S., Hoi, S.C.H., 2021. Codet5: Identifier-aware uni- fied pre-trained encoder-decoder models for code understanding and generation.	spa
dc.relation.references	Washizaki, H., Ogata, S., Hazeyama, A., Okubo, T., Fernandez, E.B., Yoshioka, N., 2020. Landscape of architecture and design patterns for iot systems. IEEE Internet of Things Journal 7, 10091–10101	spa
dc.relation.references	Yussupov, V., Soldani, J., Breitenbücher, U., Brogi, A., Leymann, F., 2021. From serverful to serverless: A spectrum of patterns for hosting application components, pp. 268–279	spa
dc.relation.references	Zeng, C., Yu, Y., Li, S., Xia, X., Wang, Z., Geng, M., Xiao, B., Dong, W., Liao, X., 2021. degraphcs: Embedding variable-based flow graph for neural code search	spa
dc.relation.references	Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X., 2019. A novel neural source code representation based on abstract syntax tree, in: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794	spa
dc.relation.references	Zhang, X., Fan, J., Hei, M., 2022. Compressing bert for binary text classification via adaptive truncation before fine-tuning. Applied Sciences 12.	spa
dc.relation.references	Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q., 2021. A comprehensive survey on transfer learning. Proceedings of the IEEE 109, 43–76.	spa
dc.rights.accessrights	info:eu-repo/semantics/openAccess	spa
dc.rights.creativecommons	Atribución 4.0 Internacional (CC BY 4.0)	spa
dc.subject.armarc	Infraestructura como código
dc.subject.armarc	Conocimiento implícito
dc.subject.armarc	Transferencia de conocimiento
dc.subject.armarc	Patrones de arquitectura
dc.subject.armarc	Modelos de lenguaje
dc.subject.proposal	Infraestructura como código	spa
dc.subject.proposal	Conocimiento implícito	spa
dc.subject.proposal	Transferencia de conocimiento	spa
dc.subject.proposal	Patrones de arquitectura	spa
dc.subject.proposal	Modelos de lenguaje	spa
dc.subject.proposal	Infrastructure as code	eng
dc.subject.proposal	Implicit knowledge	eng
dc.subject.proposal	Knowledge transfer	eng
dc.subject.proposal	Architecture patterns	eng
dc.subject.proposal	Language models	eng
dc.type.coar	http://purl.org/coar/resource_type/c_bdcc	spa
dc.type.content	Text	spa
dc.type.driver	info:eu-repo/semantics/masterThesis	spa
dc.type.redcol	https://purl.org/redcol/resource_type/TM	spa