Julián Alarte (Universitat Politècnica de València, Valencia, Spain) |
David Insa (Universitat Politècnica de València, Valencia, Spain) |
Josep Silva (Universitat Politècnica de València, Valencia, Spain) |
Salvador Tamarit (Universidad Politécnica de Madrid, Madrid, Spain) |
Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the template. This work introduces a new technique to automatically discover a reduced set of webpages in a website that implement the template. This set is computed with an hyperlink analysis that computes a very small set with a high level of confidence. |
ArXived at: https://dx.doi.org/10.4204/EPTCS.163.2 | bibtex | |
Comments and questions to: eptcs@eptcs.org |
For website issues: webmaster@eptcs.org |