From Hand-Drawn Sketches to Interactive Web Prototypes: A Reproducible Vision-Language Approach with Structural and Visual Consistency Evaluation
DOI:
https://doi.org/10.51903/jtie.v4i2.490Keywords:
sketch-to-code, web prototype generation, vision-language model, HTML/CSS synthesis, service designAbstract
Service design workflows often begin with low-fidelity sketches that must be quickly translated into interactive prototypes. This paper studies the Sketch-to-Web problem: generating HTML/CSS prototypes from hand-drawn UI sketches and evaluating fidelity with both structural and visual metrics. Because the original Sketch2Code benchmark is distributed primarily as compressed artifacts that are not executable in our restricted runtime, we construct Sketch2Code-Synth, a size-matched and protocol-matched instantiation containing 731 hand-drawn-style sketches paired with 484 webpage prototypes while preserving the same sketch-to-HTML task interface. We implement a lightweight constrained sketch-to-HTML baseline (ProtoVLM) that combines HOG-based template recognition with template-conditioned HTML/CSS instantiation. We compare ProtoVLM against three baselines (kNN retrieval, heuristic computer vision layout extraction, and majority-template generation) and an oracle upper bound. Evaluation uses (i) DOM tree edit distance computed on a containment-induced layout tree, (ii) element-level IoU with Hungarian matching, and (iii) wireframe SSIM on 200×150 rasterized layouts. On the held-out test split (97 pages, 147 sketches), ProtoVLM achieves a mean tree edit distance of 2.224, mean element IoU of 0.755, and mean SSIM of 0.474. Relative to kNN retrieval, the main gain is in localization stability (IoU 0.755 vs. 0.697), while structural distance is similar (TED 2.224 vs. 2.422). Because the benchmark uses a controlled template library and wireframe renderings, the results should be interpreted as evidence on constrained layout recognition and prototype normalization rather than unconstrained real-world sketch understanding. In this setting, SSIM measures layout resemblance only, not interface realism or usability.
References
Asmaraloka, A. M., Hermansyah, M. A., Nisa, K., Saputra, F. H. D., & Setiawan, A. (2025). Implementation of the K Nearest Neighbor (KNN) Algorithm in Handwritten Digit Pattern Recognition Using the Zoning Method. JUISI: Jurnal Ilmiah Sistem Informasi, 4(2), 175–185. https://doi.org/10.51903/kf6s5f56
Beltramelli, T. (2017). Pix2Code: Generating Code from a Graphical User Interface Screenshot. arXiv Preprint arXiv:1705.07962. https://arxiv.org/abs/1705.07962
Brückner, L., Leiva, L. A., & Oulasvirta, A. (2022). Learning GUI Completions With User-Defined Constraints. ACM Transactions on Interactive Intelligent Systems, 12(1), 1–40. https://doi.org/10.1145/3490034
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection With Transformers. In Computer Vision – ECCV 2020: 16th European Conference on Computer Vision, 213–229. https://doi.org/10.1007/978-3-030-58452-8_13
Dalal, N., & Triggs, B. (2005). Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 886–893. https://doi.org/10.1109/cvpr.2005.177
Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., & Kumar, R. (2017). Rico: A Mobile App Dataset for Building Data-Driven Design Applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST), 845–854. https://doi.org/10.1145/3126594.3126651
Eitz, M., Hays, J., & Alexa, M. (2012). How Do Humans Sketch Objects? ACM Transactions on Graphics, 31(4), 44. https://doi.org/10.1145/2185520.2185530
Jaccard, P. (1901). Étude Comparative de la Distribution Florale dans une Portion des Alpes et du Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547–579. https://www.biodiversitylibrary.org/page/26644004
Kuhn, H. W. (1955). The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2(1), 83–97. https://doi.org/10.1002/nav.3800020109
Lee, H.-Y., Jiang, L., Essa, I., Le, P. B., Gong, H., Yang, M.-H., & Yang, W. (2020). Neural Design Network: Graphic Layout Generation With Constraints. In Computer Vision – ECCV 2020, 491–506. https://doi.org/10.1007/978-3-030-58598-3_30
Li, R., Zhang, Y., & Yang, D. (2025). Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3921–3955. https://doi.org/10.18653/v1/2025.naacl-long.198
Microsoft. (2018). Sketch2Code: Transform Hand-Drawn Designs into HTML Code. Microsoft AI Lab Project. https://www.microsoft.com/en-us/ai/ai-lab-sketch2code
Munkres, J. (1957). Algorithms for the Assignment and Transportation Problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38. https://doi.org/10.1137/0105003
Pebadja, G., & Kholifah, S. (2023). The Impact of Brand Image, Pricing Strategies, and Product Quality on Consumer Loyalty in the Coffee Industry: An Empirical Study Using Structural Equation Modeling. Journal of Management and Informatics, 2(2), 1–20. https://doi.org/10.51903/jmi.v2i2.134
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R CNN: Towards Real Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems (NeurIPS), 91–99. https://doi.org/10.1109/tpami.2016.2577031
Sholekhah, D. Z., & Noviar, D. (2025). Integrative Deep Learning Architecture for High Accuracy Medical Image Segmentation: Combining U Net, ResNet, and Transformers. Journal of Technology Informatics and Engineering, 4(1), 115–134. https://doi.org/10.51903/jtie.v4i1.288
Si, C., Zhang, Y., Li, R., Yang, Z., Liu, R., & Yang, D. (2025). Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3956–3974. https://doi.org/10.18653/v1/2025.naacl-long.199
Stickdorn, M., & Schneider, J. (2011). This Is Service Design Thinking: Basics, Tools, Cases. BIS Publishers. https://www.worldcat.org/title/this-is-service-design-thinking-basics-tools-cases/849722713
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4), 600–612. https://doi.org/10.1109/tip.2003.819861
Wan, Y., Wang, C., Dong, Y., Wang, W., Li, S., Huo, Y., & Lyu, M. R. (2024). Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach. arXiv Preprint arXiv:2406.16386. https://arxiv.org/abs/2406.16386
u, Y., Li, M., Cui, L., Huang, S., Wei, F., & Li, M. (2020). LayoutLM: Pre-Training of Text and Layout for Document Image Understanding. In Proceedings of KDD, 1192–1200. https://doi.org/10.1145/3394486.3403183
Yin, P., & Neubig, G. (2017). A Syntactic Neural Model for General-Purpose Code Generation. In Proceedings of ACL, 440–450. https://doi.org/10.18653/v1/P17-1041
Zhang, K., & Shasha, D. (1989). Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal on Computing, 18(6), 1245–1262. https://doi.org/10.1137/0218062
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Yushan Chen

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

