CISPA
Browse
cispa_all_3806.pdf (3.91 MB)

HTML Violations and Where to Find Them: A Longitudinal Analysis of Specification Violations in HTML

Download (3.91 MB)
conference contribution
posted on 2023-11-29, 18:22 authored by Florian HantkeFlorian Hantke, Ben StockBen Stock
With the increased interest in the web in the 90s, everyone wanted to have their own website. However, given the lack of knowledge, such pages contained numerous HTML specification violations. This was when browser vendors came up with a new feature – error tolerance. This feature, part of browsers ever since, makes the HTML parsers tolerate and instead fix violations temporarily. On the downside, it risks security issues like Mutation XSS and Dangling Markup. In this paper, we asked ourselves, do we still need to rely on this error tolerance, or can we abandon this security issue? To answer this question, we study the evolution of HTML violations over the past eight years. To this end, we identify security-relevant violations and leverage Common Crawl to check archived pages for these. Using this framework, we automatically analyze over 23K popular domains over time. This analysis reveals that while the number of violations has decreased over the years, more than 68% of all domains still contain at least one HTML violation today. While this number is obviously too high for browser vendors to tighten the parsing process immediately, we show that automatic approaches could quickly correct up to 46% of today’s violations. Based on our findings, we propose a roadmap for how we could tighten this process to improve the quality of HTML markup in the long run.

History

Preferred Citation

Florian Hantke and Ben Stock. HTML Violations and Where to Find Them: A Longitudinal Analysis of Specification Violations in HTML. In: ACM Internet Measurement Conference (IMC). 2022.

Primary Research Area

  • Empirical and Behavioral Security

Name of Conference

ACM Internet Measurement Conference (IMC)

Legacy Posted Date

2022-10-12

Open Access Type

  • Green

BibTeX

@inproceedings{cispa_all_3806, title = "HTML Violations and Where to Find Them: A Longitudinal Analysis of Specification Violations in HTML", author = "Hantke, Florian and Stock, Ben", booktitle="{ACM Internet Measurement Conference (IMC)}", year="2022", }

Usage metrics

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC