From Hand-Drawn Sketches to Interactive Web Prototypes: A Reproducible Vision-Language Approach with Structural and Visual Consistency Evaluation

Yushan Chen; Maoxi  Li

doi:10.51903/jtie.v4i2.490

Authors

Yushan Chen Service Design, Savannah College of Art and Design, GA, USA
Maoxi Li Business Analytics, Fordham University, NY, USA

DOI:

https://doi.org/10.51903/jtie.v4i2.490

Keywords:

sketch-to-code, web prototype generation, vision-language model, HTML/CSS synthesis, service design

Abstract

Service design workflows often begin with low-fidelity sketches that must be quickly translated into interactive prototypes. This paper studies the Sketch-to-Web problem: generating HTML/CSS prototypes from hand-drawn UI sketches and evaluating fidelity with both structural and visual metrics. Because the original Sketch2Code benchmark is distributed primarily as compressed artifacts that are not executable in our restricted runtime, we construct Sketch2Code-Synth, a size-matched and protocol-matched instantiation containing 731 hand-drawn-style sketches paired with 484 webpage prototypes while preserving the same sketch-to-HTML task interface. We implement a lightweight constrained sketch-to-HTML baseline (ProtoVLM) that combines HOG-based template recognition with template-conditioned HTML/CSS instantiation. We compare ProtoVLM against three baselines (kNN retrieval, heuristic computer vision layout extraction, and majority-template generation) and an oracle upper bound. Evaluation uses (i) DOM tree edit distance computed on a containment-induced layout tree, (ii) element-level IoU with Hungarian matching, and (iii) wireframe SSIM on 200×150 rasterized layouts. On the held-out test split (97 pages, 147 sketches), ProtoVLM achieves a mean tree edit distance of 2.224, mean element IoU of 0.755, and mean SSIM of 0.474. Relative to kNN retrieval, the main gain is in localization stability (IoU 0.755 vs. 0.697), while structural distance is similar (TED 2.224 vs. 2.422). Because the benchmark uses a controlled template library and wireframe renderings, the results should be interpreted as evidence on constrained layout recognition and prototype normalization rather than unconstrained real-world sketch understanding. In this setting, SSIM measures layout resemblance only, not interface realism or usability.

References

Asmaraloka, A. M., Hermansyah, M. A., Nisa, K., Saputra, F. H. D., & Setiawan, A. (2025). Implementation of the K Nearest Neighbor (KNN) Algorithm in Handwritten Digit Pattern Recognition Using the Zoning Method. JUISI: Jurnal Ilmiah Sistem Informasi, 4(2), 175–185. https://doi.org/10.51903/kf6s5f56

Beltramelli, T. (2017). Pix2Code: Generating Code from a Graphical User Interface Screenshot. arXiv Preprint arXiv:1705.07962. https://arxiv.org/abs/1705.07962

Brückner, L., Leiva, L. A., & Oulasvirta, A. (2022). Learning GUI Completions With User-Defined Constraints. ACM Transactions on Interactive Intelligent Systems, 12(1), 1–40. https://doi.org/10.1145/3490034

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection With Transformers. In Computer Vision – ECCV 2020: 16th European Conference on Computer Vision, 213–229. https://doi.org/10.1007/978-3-030-58452-8_13

Dalal, N., & Triggs, B. (2005). Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 886–893. https://doi.org/10.1109/cvpr.2005.177

Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., & Kumar, R. (2017). Rico: A Mobile App Dataset for Building Data-Driven Design Applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST), 845–854. https://doi.org/10.1145/3126594.3126651

Eitz, M., Hays, J., & Alexa, M. (2012). How Do Humans Sketch Objects? ACM Transactions on Graphics, 31(4), 44. https://doi.org/10.1145/2185520.2185530

Jaccard, P. (1901). Étude Comparative de la Distribution Florale dans une Portion des Alpes et du Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547–579. https://www.biodiversitylibrary.org/page/26644004

Kuhn, H. W. (1955). The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2(1), 83–97. https://doi.org/10.1002/nav.3800020109

Lee, H.-Y., Jiang, L., Essa, I., Le, P. B., Gong, H., Yang, M.-H., & Yang, W. (2020). Neural Design Network: Graphic Layout Generation With Constraints. In Computer Vision – ECCV 2020, 491–506. https://doi.org/10.1007/978-3-030-58598-3_30

Li, R., Zhang, Y., & Yang, D. (2025). Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3921–3955. https://doi.org/10.18653/v1/2025.naacl-long.198

Microsoft. (2018). Sketch2Code: Transform Hand-Drawn Designs into HTML Code. Microsoft AI Lab Project. https://www.microsoft.com/en-us/ai/ai-lab-sketch2code

Munkres, J. (1957). Algorithms for the Assignment and Transportation Problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38. https://doi.org/10.1137/0105003

Pebadja, G., & Kholifah, S. (2023). The Impact of Brand Image, Pricing Strategies, and Product Quality on Consumer Loyalty in the Coffee Industry: An Empirical Study Using Structural Equation Modeling. Journal of Management and Informatics, 2(2), 1–20. https://doi.org/10.51903/jmi.v2i2.134

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R CNN: Towards Real Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems (NeurIPS), 91–99. https://doi.org/10.1109/tpami.2016.2577031

Sholekhah, D. Z., & Noviar, D. (2025). Integrative Deep Learning Architecture for High Accuracy Medical Image Segmentation: Combining U Net, ResNet, and Transformers. Journal of Technology Informatics and Engineering, 4(1), 115–134. https://doi.org/10.51903/jtie.v4i1.288

Si, C., Zhang, Y., Li, R., Yang, Z., Liu, R., & Yang, D. (2025). Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3956–3974. https://doi.org/10.18653/v1/2025.naacl-long.199

Stickdorn, M., & Schneider, J. (2011). This Is Service Design Thinking: Basics, Tools, Cases. BIS Publishers. https://www.worldcat.org/title/this-is-service-design-thinking-basics-tools-cases/849722713

Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4), 600–612. https://doi.org/10.1109/tip.2003.819861

Wan, Y., Wang, C., Dong, Y., Wang, W., Li, S., Huo, Y., & Lyu, M. R. (2024). Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach. arXiv Preprint arXiv:2406.16386. https://arxiv.org/abs/2406.16386

u, Y., Li, M., Cui, L., Huang, S., Wei, F., & Li, M. (2020). LayoutLM: Pre-Training of Text and Layout for Document Image Understanding. In Proceedings of KDD, 1192–1200. https://doi.org/10.1145/3394486.3403183

Yin, P., & Neubig, G. (2017). A Syntactic Neural Model for General-Purpose Code Generation. In Proceedings of ACL, 440–450. https://doi.org/10.18653/v1/P17-1041

Zhang, K., & Shasha, D. (1989). Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal on Computing, 18(6), 1245–1262. https://doi.org/10.1137/0218062