LLM-Driven CI Failure Diagnosis and Automated Repair: From GitHub Actions Logs to Patch Recommendation

Hanqi Zhang

doi:10.51903/jtie.v4i1.484

Authors

Hanqi Zhang Computer Science, University of Michigan at Ann Arbor, MI, USA

DOI:

https://doi.org/10.51903/jtie.v4i1.484

Keywords:

CI/CD, GitHub Actions, failure diagnosis, automated program repair, retrieval-augmented generation

Abstract

Continuous Integration (CI) pipelines surface regressions early but also produce long, noisy logs. Diagnosing a failing GitHub Actions run and drafting a safe repair patch can be time-consuming, especially when dealing with dependency drift or configuration errors. We study a practical CI-repair pipeline decomposed into three measurable tasks: (1) coarse failure-type classification, (2) retrieval-based repair (log similarity reuse the closest historical fix diff), and (3) constrained patch generation that emits a unified diff via template+slot filling. The pipeline follows the schema and task framing of JetBrains-Research’s lca-ci-builds-repair dataset from Long Code Arena (212 samples). Because runtime restrictions in our environment prevent downloading the original Hugging Face-hosted parquet files, all quantitative results in this paper are evaluated on a locally generated proxy dataset, CI-Repair-Sim212, which matches the benchmark’s field schema and evaluation protocol. On CI-Repair-Sim212, failure-type classification reaches a ceiling (Macro-F1=1.000), whereas repair-pattern prediction remains harder (Macro-F1=0.796 with log+workflow). For patch recommendation, retrieval achieves Token-F1@1=0.898 and Pattern@1=0.783 when combining logs with workflow context, and constrained generation further improves diff similarity to Token-F1=0.923. Across tasks, adding workflow YAML context yields consistent gains, motivating hybrid CI assistants that prioritize retrieval when near-duplicate failures exist and fall back to constrained generation when close matches are absent.

References

Beller, M., Gousios, G., & Zaidman, A. (2017). TravisTorrent: A Dataset of Travis CI Build Results. In Proceedings of the 14th International Conference on Mining Software Repositories (MSR 2017), 447–450. https://doi.org/10.1109/msr.2017.29

Bogomolov, E., et al. (2024). Long Code Arena: A Benchmark for Long-Context LLM Evaluation. arXiv. https://doi.org/10.48550/arxiv.2406.11612

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models Are Few Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arxiv.2005.14165

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://doi.org/10.18653/v1/n19 1423

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). CodeBERT: A Pre Trained Model for Programming and Natural Languages. Findings of the Association for Computational Linguistics: EMNLP 2020, 1536–1547. https://doi.org/10.18653/v1/2020.findings emnlp.139

Fowler, M. (2006). Continuous Integration. MartinFowler.com. https://martinfowler.com/articles/continuousIntegration.html

GitHub. (2026). GitHub Actions Documentation. GitHub Docs. https://docs.github.com/en/actions

Hilton, M., Tunnell, T., Huang, K., Marinov, D., & Dig, D. (2016). Usage, Costs, and Benefits of Continuous Integration in Open Source Projects. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), 426–437. https://doi.org/10.1145/2970276.2970358

Hugging Face. (2026). Dataset Viewer: /Rows Endpoint (Datasets Server) Documentation. Hugging Face Docs. https://hugging-face.cn/docs/dataset-viewer/rows

Hugging Face. (2026). Xet: Our Storage Backend (Xet Storage Information for Hugging Face Repositories). Hugging Face Docs. https://hugging-face.co/docs/xet-storage

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. R. (2024). SWE Bench: Can Language Models Resolve Real World GitHub Issues? In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=VTF8yNQM66

JetBrains Research. (2025). LCA CI Builds Repair [Data set]. Hugging Face. https://huggingface.co/datasets/jetbrains-research/lca-ci-builds-repair

Just, R., Jalali, D., & Ernst, M. D. (2014). Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA 2014), 437–440. https://doi.org/10.1145/2610384.2628055

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models Are Zero Shot Reasoners. arXiv. https://doi.org/10.48550/arxiv.2205.11916

Long, F., & Rinard, M. (2016). Automatic Patch Generation by Learning Correct Code. In Proceedings of the 43rd ACM SIGPLAN Symposium on Principles of Programming Languages (POPL 2016), 298–312. https://doi.org/10.1145/2837614.2837617

OpenAI. (2023). GPT 4 Technical Report. arXiv. https://doi.org/10.48550/arxiv.2303.08774

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning With a Unified Text to Text Transformer. Journal of Machine Learning Research, 21(140), 1–67. http://jmlr.org/papers/v21/20 074.html

Reimers, N., & Gurevych, I. (2019). Sentence BERT: Sentence Embeddings Using Siamese BERT Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP IJCNLP 2019), 3982–3992. https://doi.org/10.18653/v1/d19 1410

Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019

Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Défossez, A., Copet, J., Azhar, F., Touvron, H., Martin, L., Usunier, N., Scialom, T., & Synnaeve, G. (2023). Code Llama: Open Foundation Models for Code. arXiv. https://doi.org/10.48550/arxiv.2308.12950

Silalahi, F. D., Putra, T. W. A., & Siswanto, E. (2022). Machine Learning Technique for Credit Card Scam Detection. Journal of Technology Informatics and Engineering, 1(1), 50–79. https://doi.org/10.51903/jtie.v1i1.143

Tufano, M., Watson, C., Bavota, G., Di Penta, M., White, M., & Poshyvanyk, D. (2019). An Empirical Study on Learning Bug Fixing Patches in the Wild via Neural Machine Translation. ACM Transactions on Software Engineering and Methodology, 28(4), 1–29. https://doi.org/10.1145/3345317

Vasilescu, B., Yu, Y., Wang, H., Devanbu, P., & Filkov, V. (2015). Quality and Productivity Outcomes Relating to Continuous Integration in GitHub. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015), 805–816. https://doi.org/10.1145/2786805.2786860

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 5998–6008. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa Paper.pdf

Wang, Y., Le, H., Gotmare, A. D., Bui, N. D. Q., Li, J., & Hoi, S. C. H. (2023). CodeT5+: Open Code Large Language Models for Code Understanding and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), 1069–1088. https://doi.org/10.18653/v1/2023.emnlp-main.68

Weimer, W., Nguyen, T., Le Goues, C., & Forrest, S. (2009). Automatically Finding Patches Using Genetic Programming. In Proceedings of the 31st International Conference on Software Engineering (ICSE 2009), 364–374. https://doi.org/10.1109/icse.2009.5070521