Efficient Temporal Segmentation And Classification Of Short-Form Video Content Using Lightweight CNN-LSTM Architecture

Ben Liu Tan; Chstina Angel Liem; Mohamed Amen

doi:10.51903/jtie.v5i1.441

Authors

Ben Liu Tan Monash University, Melbourne, Victoria, Australia, 3800
Chstina Angel Liem Monash University, Melbourne, Victoria, Australia, 3800
Mohamed Amen Monash University, Melbourne, Victoria, Australia, 3800

DOI:

https://doi.org/10.51903/jtie.v5i1.441

Keywords:

Lightweight deep learning, Temporal segmentation, Short-form video classification, CNN-LSTM, Multimedia content analysis

Abstract

The exponential rise of short-form video platforms such as TikTok, Instagram Reels, and YouTube Shorts has transformed digital content consumption patterns, creating both opportunities and challenges in media analysis. One critical need is the efficient segmentation and classification of temporal segments within these videos to enable applications in content moderation, targeted advertising, and audience behavior research. This study proposes a lightweight deep learning architecture that integrates Convolutional Neural Networks (CNN) for visual feature extraction and Long Short-Term Memory (LSTM) networks for temporal sequence modeling. The proposed CNN-LSTM framework is optimized for computational efficiency while maintaining high classification accuracy, making it suitable for deployment in resource-constrained environments. Experimental evaluations on a curated short-form video dataset show that the model achieves competitive performance compared with larger architectures, with significant reductions in memory usage and inference time. Furthermore, the temporal segmentation module effectively isolates meaningful visual-audio segments, enabling more precise classification outcomes. The results highlight the potential of lightweight architectures to address the scalability demands of modern video analysis systems without sacrificing accuracy. This research contributes to the growing discourse on efficient multimedia processing by bridging the gap between high-performance models and practical, real-time applications in the evolving short-form video ecosystem.

References

Athar, A., Mahadevan, S., Ošep, A., Leal Taixé, L., & Leibe, B. (2020). STEm Seg: Spatio Temporal Embeddings for Instance Segmentation in Videos. In Proceedings of the European Conference on Computer Vision (ECCV 2020), 158–177. https://doi.org/10.1007/978-3-030-58621-8_10

Bahroun, Z., Anane, C., Ahmed, V., & Zacca, A. (2023). Transforming Education: A Comprehensive Review of Generative Artificial Intelligence in Educational Settings through Bibliometric and Content Analysis. Sustainability, 15(17), 12983. https://doi.org/10.3390/su151712983

Chen, M.-H., Li, B., Bao, Y., Alregib, G., & Kira, Z. (2020). Action Segmentation with Joint Self-Supervised Temporal Domain Adaptation. https://github.com/cmhungsteve/SSTDA

Ding, G., Sener, F., & Yao, A. (2023). Temporal Action Segmentation: An Analysis of Modern Techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2), 1011–1030. https://doi.org/10.1109/tpami.2023.3327284

Doriguzzi-Corin, R., Millar, S., Scott-Hayward, S., Martinez-Del-Rincon, J., & Siracusa, D. (2020). Lucid: A Practical, Lightweight Deep Learning Solution for DDoS Attack Detection. IEEE Transactions on Network and Service Management, 17(2), 876–889. https://doi.org/10.1109/tnsm.2020.2971776

Dzhoha, A., Mirylenka, K., Malykh, E., Buchmann, M.-A., & Catino, F. (2025). Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges. arXiv preprint arXiv:2507.19346. http://arxiv.org/abs/2507.19346

Elmaz, F., Eyckerman, R., Casteels, W., Latré, S., & Hellinckx, P. (2021). CNN-LSTM Architecture for Predictive Indoor Temperature Modeling. Building and Environment, 206, 108327. https://doi.org/10.1016/j.buildenv.2021.108327

Grammatikopoulou, M., Sanchez-Matilla, R., Bragman, F., Owen, D., Culshaw, L., Kerr, K., Stoyanov, D., & Luengo, I. (2023). A Spatio-Temporal Network for Video Semantic Segmentation in Surgical Videos. arXiv preprint arXiv:2306.11052. http://arxiv.org/abs/2306.11052

Hariguna, T., Li, M., Sadat, A. M., Zhang, W., & Wang, H. (2022). Privacy Concerns Toward Short-Form Video Platforms: Scale Development and Validation. Frontiers in Psychology, 13, 954964. https://doi.org/10.3389/fpsyg.2022.954964

Huang, Y., Sugano, Y., & Sato, Y. (2020). Improving Action Segmentation via Graph-Based Temporal Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14024–14034. https://doi.org/10.1109/cvpr42600.2020.01404

Kleinheksel, A. J., Rockich-Winston, N., Tawfik, H., & Wyatt, T. R. (2020). Demystifying content analysis. American Journal of Pharmaceutical Education, 84(1), 127–137. https://doi.org/10.5688/ajpe7113

Li, Z., Farha, Y. A., & Gall, J. (2021). Temporal Action Segmentation from Timestamp Supervision. https://github.com/ZheLi2020/

Liu, H.-I., Galindo, M., Xie, H., Wong, L.-K., Shuai, H.-H., Li, Y.-H., & Cheng, W.-H. (2024). Lightweight Deep Learning for Resource-Constrained Environments: A Survey. http://arxiv.org/abs/2404.07236

Lu, W., Li, J., Li, Y., Sun, A., & Wang, J. (2020). A CNN LSTM Based Model to Forecast Stock Prices. Complexity, 2020(1), 1–10. https://doi.org/10.1155/2020/662292

Matuan, H., Dude, E., Mallo, A., Yowey, H., Patey, Y. S., & Sutejo, H. (2026). Application of the K-Means Method for Grouping Product Data Based on Sales Level. Jurnal Ilmiah Sistem Informasi, 5(1), 292–305. https://doi.org/10.51903/53pfrd78

Mittal, P. (2024). A Comprehensive Survey of Deep Learning-Based Lightweight Object Detection Models for Edge Devices. Artificial Intelligence Review, 57(9), 242. https://doi.org/10.1007/s10462-024-10877-1

Montefalcon, M. D., Padilla, J. R., Paulino, J., Go, J., Llabanes Rodriguez, R., & Imperial, J. M. (2021). Understanding Facial Expression Expressing Hate from Online Short-form Videos. ACM International Conference Proceeding Series, 201–207. https://doi.org/10.1145/3485768.3485785

Narin, N. G. (2021). A Content Analysis of the Metaverse Articles. Journal of Metaverse, 1(1), 17–24. http://dergipark.org.tr/en/pub/jmv/issue/67581/1051382

O’Hagan, E. T., Traeger, A. C., Bunzli, S., Leake, H. B., Schabrun, S. M., Wand, B. M., O’Neill, S., Harris, I. A., & McAuley, J. H. (2021). What Do People Post on Social Media Relative to Low Back Pain? A Content Analysis of Australian Data. Musculoskeletal Science and Practice, 54, 102402. https://doi.org/10.1016/j.msksp.2021.102402

Oyetunji, T. P., Arafat, S. M. Y., Famori, S. O., Akinboyewa, T. B., Afolami, M., Ajayi, M. F., & Kar, S. K. (2021). Suicide in Nigeria: Observations from the content analysis of newspapers. General Psychiatry, 34(1), e100347. https://doi.org/10.1136/gpsych-2020-100347

Pasquarella, V. J., Arévalo, P., Bratley, K. H., Bullock, E. L., Gorelick, N., Yang, Z., & Kennedy, R. E. (2022). Demystifying LandTrendr and CCDC Temporal Segmentation. International Journal of Applied Earth Observation and Geoinformation, 110, 102806. https://doi.org/10.1016/j.jag.2022.102806

Raharjo, B., Rudjiono, & Fitrianto, Y. (2024). Prediction and Detection of Scam Threats on Digital Platforms for Indonesian Users Using Machine Learning Models. Journal of Technology Informatics and Engineering, 3(3), 350–369. https://doi.org/10.51903/jtie.v3i3.208

Rostamian, A., & O’Hara, J. G. (2022). Event Prediction Within Directional Change Framework Using a CNN LSTM Model. Neural Computing and Applications, 34(20), 17193–17205. https://doi.org/10.1007/s00521-022-07687-3

Rufai, S. R., & Bunce, C. (2020). World leaders’ Usage of Twitter in response To the COVID-19 Pandemic: A Content Analysis. Journal of Public Health (United Kingdom), 42(3), 510–516. https://doi.org/10.1093/pubmed/fdaa049

Shuvo, S. B., Ali, S. N., Swapnil, S. I., Al-Rakhami, M. S., & Gumaei, A. (2021). CardioXNet: A Novel Lightweight Deep Learning Framework for Cardiovascular Disease Classification Using Heart Sound Recordings. IEEE Access, 9, 36955–36967. https://doi.org/10.1109/access.2021.3063129

Singhania, D., Rahaman, R., & Yao, A. (2023). C2F-TCN: A Framework for Semi- and Fully-Supervised Temporal Action Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 11484–11501. https://doi.org/10.1109/tpami.2023.3284080

Susilo, B. W., & Susanto, E. (2024). Employing Artificial Intelligence in Management Information Systems to Improve Business Efficiency. Journal of Management and Informatics, 3(2), 212–229. https://doi.org/10.51903/jmi.v3i2.30

Ullah, K., Ahsan, M., Hasanat, S. M., Haris, M., Yousaf, H., Raza, S. F., Tandon, R., Abid, S., & Ullah, Z. (2024). Short-Term Load Forecasting: A Comprehensive Review and Simulation Study with CNN-LSTM Hybrids Approach. IEEE Access, 12, 111858–111881. https://doi.org/10.1109/access.2024.3440631

Wang, Y., Yang, J., Liu, M., & Gui, G. (2020). LightAMC: Lightweight Automatic Modulation Classification via Deep Learning and Compressive Sensing. IEEE Transactions on Vehicular Technology, 69(3), 3491–3495. https://doi.org/10.1109/tvt.2020.2971001

Wei, L., Ding, K., & Hu, H. (2020). Automatic Skin Cancer Detection in Dermoscopy Images Based on Ensemble Lightweight Deep Learning Network. IEEE Access, 8, 99633–99647. https://doi.org/10.1109/access.2020.2997710

Zhang, C., Zheng, H., & Wang, Q. (2022). Driving Factors and Moderating Effects Behind Citizen Engagement With Mobile Short-Form Videos. IEEE Access, 10, 40999–41009. https://doi.org/10.1109/access.2022.3167687

Zhao, Y., Yin, Y., & Gui, G. (2020). Lightweight Deep Learning Based Intelligent Edge Surveillance Techniques. IEEE Transactions on Cognitive Communications and Networking, 6(4), 1146–1154. https://doi.org/10.1109/tccn.2020.2999479