LLM-Style DevOps Copilot for Cloud-Native Troubleshooting: Retrieval-Augmented Runbook Generation and Command-Safety Evaluation

Boning  Zhang; Xinzhuo  Sun; Ge  Liu; Binghua  Zhou

doi:10.51903/jtie.v5i2.534

Authors

Boning Zhang Computer Science, Georgetown University, DC, USA
Xinzhuo Sun Computer Engineering, Cornell Tech, NY, USA
Ge Liu Computer Science, USC, CA, USA
Binghua Zhou Computer Science, USC, CA, USA

DOI:

https://doi.org/10.51903/jtie.v5i2.534

Keywords:

AIOps, Cloud-native troubleshooting, Command safety, Retrieval-augmented generation, SRE automation

Abstract

Cloud-native incident response requires engineers to connect symptoms, observability signals, infrastructure state, and safe remediation commands under time pressure. Large language models can draft runbooks, but an ungrounded assistant can invent commands, recommend the wrong diagnostic path, or reproduce destructive operational shortcuts. This paper evaluates an LLM-style, retrieval-augmented DevOps Copilot simulation for cloud-native troubleshooting on the canonical Szaid3680/Devops Arrow export. The experiment indexes all 42,819 rows with the public Response, Instruction, and Prompt schema and evaluates a deterministic 400-query subset with TF-IDF, BM25, a compact dense-semantic baseline, RAG-style answer construction, reranking, command-safety checking, and the combined reranker-plus-checker pipeline. No live LLM inference is used in the executed experiment; the generation and checking components are deterministic so that the safety effects can be reproduced exactly. Results show that retrieval improves answer grounding but does not by itself guarantee safe automation: RAG-only reaches 0.2966 semantic similarity and emits matched unsafe command text at a rate of 0.0324. The command-safety checker reduces the matched unsafe command rate to 0.0000 for the declared rule set and keeps command validity at 0.9922. The full pipeline obtains 0.3051 semantic similarity, 0.4225 root-cause accuracy, 0.5825 root-category accuracy, and 0.0078 hallucinated-command rate. The findings support treating DevOps copilots as retrieval-grounded and policy-checked workflow systems rather than free-form chat agents.

References

Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In International Conference on Learning Representations.

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (Eds.). (2016). Site reliability engineering: How Google runs production systems. O'Reilly Media.

Bianchi, F., Suzgun, M., Attanasio, G., Rottger, P., Jurafsky, D., Hashimoto, T., & Zou, J. (2024). Safety-tuned LLaMAs: Lessons from improving the safety of large language models that follow instructions. In International Conference on Learning Representations.

Binghua Zhou, Siming Zhao, & David Chao. (2023). LLM-Guided Energy-Aware A/B Testing for Consolidation and DVFS Policies via Power-Sensitivity Clustering. Journal of Advanced Computing Systems , 3(4), 12-30. https://doi.org/10.69987/JACS.2023.30402

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., ... Zaremba, W. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

Daren Zheng, Chenyu Li, & Harvey Davidson. (2023). Continual Red-Teaming for In-the-Wild Jailbreaks via Online Guardrail Updates and Guardrail Distillation. Journal of Advanced Computing Systems , 3(2), 35-49. https://doi.org/10.69987/JACS.2023.30203

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).

Diaz-de-Arcaya, J., Miñón, J., Almeida, A., & López-de-Ipiña, D. (2023). A joint study of the challenges, opportunities, and roadmap of MLOps and AIOps: A systematic survey. ACM Computing Surveys, 56(4), 1-30.

Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The science of lean software and DevOps. IT Revolution.

Humble, J., & Farley, D. (2010). Continuous delivery: Reliable software releases through build, test, and deployment automation. Addison-Wesley.

Ikram, A., Chakraborty, S., Mitra, S., Saini, A., Bagchi, S., & Kannan, S. (2022). Root cause analysis of failures in microservices through causal discovery. In Advances in Neural Information Processing Systems.

Jing Chen, Xinzhuo Sun, Qiyou Wu, & Matt Jackson. (2024). Risk-Calibrated Biomedical Search: Calibrated Selection of LLM-Style Query Expansions on BEIR TREC-COVID. Journal of Advanced Computing Systems , 4(4), 61-79. https://doi.org/10.69987/JACS.2024.40406

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP (pp. 6769-6781).

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems.

Li, B., Liu, S., Li, Y., Wang, J., Zhang, H., Liao, X., & Jin, H. (2022). An industrial survey of tracing and observability in microservice systems. IEEE Transactions on Services Computing, 15(6), 3253-3268.

Lin, X. V., Wang, C., Pang, D., Vu, K., Zettlemoyer, L., & Ernst, M. D. (2018). NL2Bash: A corpus and semantic parser for natural language interface to the Linux operating system. In Proceedings of LREC.

Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085.

Notaro, P., Cardoso, J., & Gerndt, M. (2020). A survey of AIOps methods for failure management. ACM Transactions on Intelligent Systems and Technology, 12(6), 1-45.

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP-IJCNLP (pp. 3982-3992).