LLM-Driven CI Failure Diagnosis and Automated Repair: From GitHub Actions Logs to Patch Recommendation
DOI:
https://doi.org/10.51903/jtie.v4i1.484Keywords:
CI/CD, GitHub Actions, failure diagnosis, automated program repair, retrieval-augmented generationAbstract
Continuous Integration (CI) pipelines surface regressions early but also produce long, noisy logs. Diagnosing a failing GitHub Actions run and drafting a safe repair patch can be time-consuming, especially when dealing with dependency drift or configuration errors. We study a practical CI-repair pipeline decomposed into three measurable tasks: (1) coarse failure-type classification, (2) retrieval-based repair (log similarity reuse the closest historical fix diff), and (3) constrained patch generation that emits a unified diff via template+slot filling. The pipeline follows the schema and task framing of JetBrains-Research’s lca-ci-builds-repair dataset from Long Code Arena (212 samples). Because runtime restrictions in our environment prevent downloading the original Hugging Face-hosted parquet files, all quantitative results in this paper are evaluated on a locally generated proxy dataset, CI-Repair-Sim212, which matches the benchmark’s field schema and evaluation protocol. On CI-Repair-Sim212, failure-type classification reaches a ceiling (Macro-F1=1.000), whereas repair-pattern prediction remains harder (Macro-F1=0.796 with log+workflow). For patch recommendation, retrieval achieves Token-F1@1=0.898 and Pattern@1=0.783 when combining logs with workflow context, and constrained generation further improves diff similarity to Token-F1=0.923. Across tasks, adding workflow YAML context yields consistent gains, motivating hybrid CI assistants that prioritize retrieval when near-duplicate failures exist and fall back to constrained generation when close matches are absent.
References
Beller, M., Gousios, G., & Zaidman, A. (2017). TravisTorrent: A Dataset of Travis CI Build Results. In Proceedings of the 14th International Conference on Mining Software Repositories (MSR 2017), 447–450. https://doi.org/10.1109/msr.2017.29
Bogomolov, E., et al. (2024). Long Code Arena: A Benchmark for Long-Context LLM Evaluation. arXiv. https://doi.org/10.48550/arxiv.2406.11612
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models Are Few Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arxiv.2005.14165
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://doi.org/10.18653/v1/n19 1423
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). CodeBERT: A Pre Trained Model for Programming and Natural Languages. Findings of the Association for Computational Linguistics: EMNLP 2020, 1536–1547. https://doi.org/10.18653/v1/2020.findings emnlp.139
Fowler, M. (2006). Continuous Integration. MartinFowler.com. https://martinfowler.com/articles/continuousIntegration.html
GitHub. (2026). GitHub Actions Documentation. GitHub Docs. https://docs.github.com/en/actions
Hilton, M., Tunnell, T., Huang, K., Marinov, D., & Dig, D. (2016). Usage, Costs, and Benefits of Continuous Integration in Open Source Projects. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), 426–437. https://doi.org/10.1145/2970276.2970358
Hugging Face. (2026). Dataset Viewer: /Rows Endpoint (Datasets Server) Documentation. Hugging Face Docs. https://hugging-face.cn/docs/dataset-viewer/rows
Hugging Face. (2026). Xet: Our Storage Backend (Xet Storage Information for Hugging Face Repositories). Hugging Face Docs. https://hugging-face.co/docs/xet-storage
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. R. (2024). SWE Bench: Can Language Models Resolve Real World GitHub Issues? In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=VTF8yNQM66
JetBrains Research. (2025). LCA CI Builds Repair [Data set]. Hugging Face. https://huggingface.co/datasets/jetbrains-research/lca-ci-builds-repair
Just, R., Jalali, D., & Ernst, M. D. (2014). Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA 2014), 437–440. https://doi.org/10.1145/2610384.2628055
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models Are Zero Shot Reasoners. arXiv. https://doi.org/10.48550/arxiv.2205.11916
Long, F., & Rinard, M. (2016). Automatic Patch Generation by Learning Correct Code. In Proceedings of the 43rd ACM SIGPLAN Symposium on Principles of Programming Languages (POPL 2016), 298–312. https://doi.org/10.1145/2837614.2837617
OpenAI. (2023). GPT 4 Technical Report. arXiv. https://doi.org/10.48550/arxiv.2303.08774
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning With a Unified Text to Text Transformer. Journal of Machine Learning Research, 21(140), 1–67. http://jmlr.org/papers/v21/20 074.html
Reimers, N., & Gurevych, I. (2019). Sentence BERT: Sentence Embeddings Using Siamese BERT Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP IJCNLP 2019), 3982–3992. https://doi.org/10.18653/v1/d19 1410
Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019
Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Défossez, A., Copet, J., Azhar, F., Touvron, H., Martin, L., Usunier, N., Scialom, T., & Synnaeve, G. (2023). Code Llama: Open Foundation Models for Code. arXiv. https://doi.org/10.48550/arxiv.2308.12950
Silalahi, F. D., Putra, T. W. A., & Siswanto, E. (2022). Machine Learning Technique for Credit Card Scam Detection. Journal of Technology Informatics and Engineering, 1(1), 50–79. https://doi.org/10.51903/jtie.v1i1.143
Tufano, M., Watson, C., Bavota, G., Di Penta, M., White, M., & Poshyvanyk, D. (2019). An Empirical Study on Learning Bug Fixing Patches in the Wild via Neural Machine Translation. ACM Transactions on Software Engineering and Methodology, 28(4), 1–29. https://doi.org/10.1145/3345317
Vasilescu, B., Yu, Y., Wang, H., Devanbu, P., & Filkov, V. (2015). Quality and Productivity Outcomes Relating to Continuous Integration in GitHub. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015), 805–816. https://doi.org/10.1145/2786805.2786860
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 5998–6008. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa Paper.pdf
Wang, Y., Le, H., Gotmare, A. D., Bui, N. D. Q., Li, J., & Hoi, S. C. H. (2023). CodeT5+: Open Code Large Language Models for Code Understanding and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), 1069–1088. https://doi.org/10.18653/v1/2023.emnlp-main.68
Weimer, W., Nguyen, T., Le Goues, C., & Forrest, S. (2009). Automatically Finding Patches Using Genetic Programming. In Proceedings of the 31st International Conference on Software Engineering (ICSE 2009), 364–374. https://doi.org/10.1109/icse.2009.5070521
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Hanqi Zhang

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

