Automatic Detection of Webpages that Share the Same Web Template

Julián Alarte
(Universitat Politècnica de València, Valencia, Spain)
David Insa
(Universitat Politècnica de València, Valencia, Spain)
Josep Silva
(Universitat Politècnica de València, Valencia, Spain)
Salvador Tamarit
(Universidad Politécnica de Madrid, Madrid, Spain)

Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the template. This work introduces a new technique to automatically discover a reduced set of webpages in a website that implement the template. This set is computed with an hyperlink analysis that computes a very small set with a high level of confidence.

In Maurice H. ter Beek and António Ravara: Proceedings 10th International Workshop on Automated Specification and Verification of Web Systems (WWV 2014), Vienna, Austria, July 18, 2014, Electronic Proceedings in Theoretical Computer Science 163, pp. 2–15.
Published: 8th September 2014.

ArXived at: https://dx.doi.org/10.4204/EPTCS.163.2 bibtex PDF
References in reconstructed bibtex, XML and HTML format (approximated).
Comments and questions to: eptcs@eptcs.org
For website issues: webmaster@eptcs.org