Distilling VMAF into an Edge-Deployable Quality Predictor: A Pilot Shot-Level Proxy with LLM-Ready Quality Tokens
DOI:
https://doi.org/10.51903/jtie.v4i2.522Keywords:
VMAF distillation, edge video quality prediction, quality guard, regression detection, H.264/AVCAbstract
This pilot study evaluates whether a compact student model can approximate VMAF well enough to support low-latency release guarding on edge-class CPU environments. The corpus comprises a 62.31-second Big Buck Bunny excerpt at 1280 × 720 and 25 fps, segmented into 13 shots. Twelve distorted variants were generated by crossing H.264/AVC and H.265/HEVC with 180p, 240p, and 360p delivery resolutions and two quality levels per codec-resolution pair, yielding 156 shot-level samples. Frame-level VMAF scores were aggregated into shot-level teacher labels, and a student proxy consumed 14 low-cost no-reference features derived from decoded frames and stream metadata. Shot-grouped five-fold cross-validation was used to prevent content leakage across train-test splits. On this corpus, a 50-tree gradient-boosted decision tree achieved MAE = 6.56 VMAF points, RMSE = 8.32, and Pearson r = 0.913. Relative to simple regressors, the student reduced MAE by approximately 21.5% versus bitrate-only regression and 10.7% versus metadata-only regression. In a single CPU-only benchmark, predictor latency was 0.484 ms per sample and the full decode-feature-predict chain averaged 42.61 ms versus 1117.41 ms for the teacher, corresponding to a 26.22× end-to-end speed-up. As a thresholded guard, the same student reached F1 = 0.826, 0.893, and 0.900 at 60, 70, and 80 VMAF respectively. These findings support the feasibility of a practical edge proxy on this specific pilot corpus, but they should not be interpreted as broad generalization across content classes or production ladders. The paper also introduces an LLM-ready token interface intended for downstream reporting rather than for replacing the underlying quality measurement
References
Aaron, A., Li, Z., Manohara, M., Lin, J. Y., Wu, E. C.-H., & Kuo, C.-C. J. (2015). Challenges in Cloud Based Ingest and Encoding for High Quality Streaming Media. 2015 IEEE International Conference on Image Processing (ICIP), 1732–1736. https://doi.org/10.1109/icip.2015.7351101
Aistov, K., & Koroteev, M. (2023). VMAF Re-Implementation on PyTorch: Some Experimental Results. arXiv, 2310.15578. https://arxiv.org/abs/2310.15578
Bampis, C. G., Li, Z., & Bovik, A. C. (2019). Spatiotemporal Feature Integration and Model Fusion for Full-Reference Video Quality Assessment. IEEE Transactions on Circuits and Systems for Video Technology, 29(8), 2256–2270. https://doi.org/10.1109/tcsvt.2018.2861234
Bampis, C. G., Li, Z., Katsavounidis, I., & Bovik, A. C. (2018). Recurrent and Dynamic Models for Predicting Streaming Video Quality of Experience. IEEE Transactions on Image Processing, 27(7), 3316–3331. https://doi.org/10.1109/tip.2018.2815842
Benjamin, N., Yulianingsih, S., & Marie, I. (2026). Explainable AI-Driven Strategic Decision-Making in SMEs: Simulation-Based Evaluation of Ethical Governance. Journal of Management and Informatics, 5(1), 1–12. https://doi.org/10.51903/jmi.v3i1.314
Blender Foundation. (2008). Big Buck Bunny [Film]. https://peach.blender.org/
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., ... Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS), 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-paper.pdf
Chen, J., Sun, X., Wu, Q., & Jackson, M. (2024). Risk-Calibrated Biomedical Search: Calibrated Selection of LLM-Style Query Expansions on BEIR TREC-COVID. Journal of Advanced Computing Systems, 4(4), 61–79. https://doi.org/10.69987/jacs.2024.40406
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, 4171–4186. https://doi.org/10.18653/v1/n19-1423
Han, S., Mao, H., & Dally, W. J. (2016). Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1510.00149
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv, 1503.02531. https://arxiv.org/abs/1503.02531
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv, 1704.04861. https://arxiv.org/abs/1704.04861
Huynh-Thu, Q., & Ghanbari, M. (2008). Scope of Validity of PSNR in Image/Video Quality Assessment. Electronics Letters, 44(13), 800–801. https://doi.org/10.1049/el:20080522
ITU-R. (2019). Methodologies for the Subjective Assessment of the Quality of Television Images (Recommendation ITU-R BT.500-14). https://www.itu.int/rec/r-rec-bt.500-14-201910-i/en
ITU-T. (2023). Subjective Video Quality Assessment Methods for Multimedia Applications (Recommendation ITU-T P.910). https://www.itu.int/rec/t-rec-p.910-202307-i/en
Li, S., Zhang, F., Ma, L., & Ngan, K. N. (2011). Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments. IEEE Transactions on Multimedia, 13(5), 935–949. https://doi.org/10.1109/tmm.2011.2152382
Li, Y. (2024). Test-in-the-Loop LLM Repair: Verifiable Automated Program Repair on QuixBugs with a “Failing Test → Patch → Regression Test” Loop. Journal of Advanced Computing Systems, 4(2), 62–75. https://doi.org/10.69987/jacs.2024.40206
Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A. K., & Manohara, M. (2016). Toward a Practical Perceptual Video Quality Metric. Netflix Tech Blog. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f207b4e20
Malolo, A. M. I. H. B., Sampetoding, E. A. M., Octavian, O., & Gomantara, J. (2025). Analysis of the Use of Interactive Video Features on the Cookpad Application for Culinary MSMEs Using TAM and SUS. Jurnal Ilmiah Sistem Informasi, 4(3), 932–943. https://doi.org/10.51903/k7ta4t87
Mittal, A., Moorthy, A. K., & Bovik, A. C. (2012). No-Reference Image Quality Assessment in the Spatial Domain. IEEE Transactions on Image Processing, 21(12), 4695–4708. https://doi.org/10.1109/tip.2012.2214050
Mittal, A., Soundararajan, R., & Bovik, A. C. (2013). Making a Completely Blind Image Quality Analyzer. IEEE Signal Processing Letters, 20(3), 209–212. https://doi.org/10.1109/lspl.2012.2227726
Netflix. (2025a). VMAF: Video Multi-Method Assessment Fusion [Software]. https://github.com/netflix/vmaf
Netflix. (2025b). Models. the VMAF GitHub Repository. https://github.com/netflix/vmaf/blob/master/resource/doc/models.md
Netflix. (2025c). Using VMAF with FFmpeg. the VMAF GitHub Repository. https://github.com/netflix/vmaf/blob/master/resource/doc/ffmpeg.md
Rassool, R. (2017). VMAF Reproducibility: Validating a Perceptual Practical Video Quality Metric. 2017 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), 1–2. https://doi.org/10.1109/bmsb.2017.7986164
Sheikh, H. R., & Bovik, A. C. (2006). Image Information and Visual Quality. IEEE Transactions on Image Processing, 15(2), 430–444. https://doi.org/10.1109/tip.2005.859378
Sun, X., Chen, J., Zhou, B., & Kuo, M.-J. (2024). ConRAG: Contradiction-Aware Retrieval-Augmented Generation under Multi-Source Conflicting Evidence. Journal of Advanced Computing Systems, 4(7), 50–64. https://doi.org/10.69987/jacs.2024.40705
Tan, B. L., Liem, C. A., & Amen, M. (2026). Efficient Temporal Segmentation and Classification of Short-Form Video Content Using Lightweight CNN-LSTM Architecture. Journal of Technology Informatics and Engineering, 5(1), 1–16. https://doi.org/10.51903/jtie.v5i1.441
Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of ICML 2019, 6105–6114. https://arxiv.org/abs/1905.11946
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4), 600–612. https://doi.org/10.1109/tip.2003.819861
Wang, Z., Simoncelli, E. P., & Bovik, A. C. (2003). Multiscale Structural Similarity for Image Quality Assessment. The 37th Asilomar Conference on Signals, Systems & Computers, 2, 1398–1402. https://doi.org/10.1109/acssc.2003.1292216
Zhao, S., Wang, H., & Davison, N. (2024). Profit-Maximizing Cost-Sensitive Credit Scoring with LLM-Extracted Policy Constraints. Journal of Advanced Computing Systems, 4(3), 91–108. https://doi.org/10.69987/jacs.2024.40307
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Xiaohan Chang, Heyu Wang

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

