Distilling VMAF into an Edge-Deployable Quality Predictor: A Pilot Shot-Level Proxy with LLM-Ready Quality Tokens

Binghua Zhou; Heyu  Wang; Xiaohan  Chang

doi:10.51903/jtie.v4i2.522

Authors

Binghua Zhou Computer Science, Universitas Southern California, CA, USA
Heyu Wang Electrical and Computer Engineering, Rice University, TX, USA
Xiaohan Chang Computer Science, University of Connecticut, CT, USA

DOI:

https://doi.org/10.51903/jtie.v4i2.522

Keywords:

VMAF distillation, edge video quality prediction, quality guard, regression detection, H.264/AVC

Abstract

This pilot study evaluates whether a compact student model can approximate VMAF well enough to support low-latency release guarding on edge-class CPU environments. The corpus comprises a 62.31-second Big Buck Bunny excerpt at 1280 × 720 and 25 fps, segmented into 13 shots. Twelve distorted variants were generated by crossing H.264/AVC and H.265/HEVC with 180p, 240p, and 360p delivery resolutions and two quality levels per codec-resolution pair, yielding 156 shot-level samples. Frame-level VMAF scores were aggregated into shot-level teacher labels, and a student proxy consumed 14 low-cost no-reference features derived from decoded frames and stream metadata. Shot-grouped five-fold cross-validation was used to prevent content leakage across train-test splits. On this corpus, a 50-tree gradient-boosted decision tree achieved MAE = 6.56 VMAF points, RMSE = 8.32, and Pearson r = 0.913. Relative to simple regressors, the student reduced MAE by approximately 21.5% versus bitrate-only regression and 10.7% versus metadata-only regression. In a single CPU-only benchmark, predictor latency was 0.484 ms per sample and the full decode-feature-predict chain averaged 42.61 ms versus 1117.41 ms for the teacher, corresponding to a 26.22× end-to-end speed-up. As a thresholded guard, the same student reached F1 = 0.826, 0.893, and 0.900 at 60, 70, and 80 VMAF respectively. These findings support the feasibility of a practical edge proxy on this specific pilot corpus, but they should not be interpreted as broad generalization across content classes or production ladders. The paper also introduces an LLM-ready token interface intended for downstream reporting rather than for replacing the underlying quality measurement

References

Aaron, A., Li, Z., Manohara, M., Lin, J. Y., Wu, E. C.-H., & Kuo, C.-C. J. (2015). Challenges in Cloud Based Ingest and Encoding for High Quality Streaming Media. 2015 IEEE International Conference on Image Processing (ICIP), 1732–1736. https://doi.org/10.1109/icip.2015.7351101

Aistov, K., & Koroteev, M. (2023). VMAF Re-Implementation on PyTorch: Some Experimental Results. arXiv, 2310.15578. https://arxiv.org/abs/2310.15578

Bampis, C. G., Li, Z., & Bovik, A. C. (2019). Spatiotemporal Feature Integration and Model Fusion for Full-Reference Video Quality Assessment. IEEE Transactions on Circuits and Systems for Video Technology, 29(8), 2256–2270. https://doi.org/10.1109/tcsvt.2018.2861234

Bampis, C. G., Li, Z., Katsavounidis, I., & Bovik, A. C. (2018). Recurrent and Dynamic Models for Predicting Streaming Video Quality of Experience. IEEE Transactions on Image Processing, 27(7), 3316–3331. https://doi.org/10.1109/tip.2018.2815842

Benjamin, N., Yulianingsih, S., & Marie, I. (2026). Explainable AI-Driven Strategic Decision-Making in SMEs: Simulation-Based Evaluation of Ethical Governance. Journal of Management and Informatics, 5(1), 1–12. https://doi.org/10.51903/jmi.v3i1.314

Blender Foundation. (2008). Big Buck Bunny [Film]. https://peach.blender.org/

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., ... Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS), 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-paper.pdf

Chen, J., Sun, X., Wu, Q., & Jackson, M. (2024). Risk-Calibrated Biomedical Search: Calibrated Selection of LLM-Style Query Expansions on BEIR TREC-COVID. Journal of Advanced Computing Systems, 4(4), 61–79. https://doi.org/10.69987/jacs.2024.40406

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, 4171–4186. https://doi.org/10.18653/v1/n19-1423

Han, S., Mao, H., & Dally, W. J. (2016). Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1510.00149

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv, 1503.02531. https://arxiv.org/abs/1503.02531

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv, 1704.04861. https://arxiv.org/abs/1704.04861

Huynh-Thu, Q., & Ghanbari, M. (2008). Scope of Validity of PSNR in Image/Video Quality Assessment. Electronics Letters, 44(13), 800–801. https://doi.org/10.1049/el:20080522

ITU-R. (2019). Methodologies for the Subjective Assessment of the Quality of Television Images (Recommendation ITU-R BT.500-14). https://www.itu.int/rec/r-rec-bt.500-14-201910-i/en

ITU-T. (2023). Subjective Video Quality Assessment Methods for Multimedia Applications (Recommendation ITU-T P.910). https://www.itu.int/rec/t-rec-p.910-202307-i/en

Li, S., Zhang, F., Ma, L., & Ngan, K. N. (2011). Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments. IEEE Transactions on Multimedia, 13(5), 935–949. https://doi.org/10.1109/tmm.2011.2152382

Li, Y. (2024). Test-in-the-Loop LLM Repair: Verifiable Automated Program Repair on QuixBugs with a “Failing Test → Patch → Regression Test” Loop. Journal of Advanced Computing Systems, 4(2), 62–75. https://doi.org/10.69987/jacs.2024.40206

Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A. K., & Manohara, M. (2016). Toward a Practical Perceptual Video Quality Metric. Netflix Tech Blog. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f207b4e20

Malolo, A. M. I. H. B., Sampetoding, E. A. M., Octavian, O., & Gomantara, J. (2025). Analysis of the Use of Interactive Video Features on the Cookpad Application for Culinary MSMEs Using TAM and SUS. Jurnal Ilmiah Sistem Informasi, 4(3), 932–943. https://doi.org/10.51903/k7ta4t87

Mittal, A., Moorthy, A. K., & Bovik, A. C. (2012). No-Reference Image Quality Assessment in the Spatial Domain. IEEE Transactions on Image Processing, 21(12), 4695–4708. https://doi.org/10.1109/tip.2012.2214050

Mittal, A., Soundararajan, R., & Bovik, A. C. (2013). Making a Completely Blind Image Quality Analyzer. IEEE Signal Processing Letters, 20(3), 209–212. https://doi.org/10.1109/lspl.2012.2227726

Netflix. (2025a). VMAF: Video Multi-Method Assessment Fusion [Software]. https://github.com/netflix/vmaf

Netflix. (2025b). Models. the VMAF GitHub Repository. https://github.com/netflix/vmaf/blob/master/resource/doc/models.md

Netflix. (2025c). Using VMAF with FFmpeg. the VMAF GitHub Repository. https://github.com/netflix/vmaf/blob/master/resource/doc/ffmpeg.md

Rassool, R. (2017). VMAF Reproducibility: Validating a Perceptual Practical Video Quality Metric. 2017 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), 1–2. https://doi.org/10.1109/bmsb.2017.7986164

Sheikh, H. R., & Bovik, A. C. (2006). Image Information and Visual Quality. IEEE Transactions on Image Processing, 15(2), 430–444. https://doi.org/10.1109/tip.2005.859378

Sun, X., Chen, J., Zhou, B., & Kuo, M.-J. (2024). ConRAG: Contradiction-Aware Retrieval-Augmented Generation under Multi-Source Conflicting Evidence. Journal of Advanced Computing Systems, 4(7), 50–64. https://doi.org/10.69987/jacs.2024.40705

Tan, B. L., Liem, C. A., & Amen, M. (2026). Efficient Temporal Segmentation and Classification of Short-Form Video Content Using Lightweight CNN-LSTM Architecture. Journal of Technology Informatics and Engineering, 5(1), 1–16. https://doi.org/10.51903/jtie.v5i1.441

Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of ICML 2019, 6105–6114. https://arxiv.org/abs/1905.11946

Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4), 600–612. https://doi.org/10.1109/tip.2003.819861

Wang, Z., Simoncelli, E. P., & Bovik, A. C. (2003). Multiscale Structural Similarity for Image Quality Assessment. The 37th Asilomar Conference on Signals, Systems & Computers, 2, 1398–1402. https://doi.org/10.1109/acssc.2003.1292216

Zhao, S., Wang, H., & Davison, N. (2024). Profit-Maximizing Cost-Sensitive Credit Scoring with LLM-Extracted Policy Constraints. Journal of Advanced Computing Systems, 4(3), 91–108. https://doi.org/10.69987/jacs.2024.40307