Ziv Bar-Yossef & Sridhar Rajagopalan (2002):
Template detection via data mining and its applications.
In: Proceedings of the 11th International Conference on World Wide Web (WWW'02).
ACM,
New York, NY, USA,
pp. 580–591,
doi:10.1145/511446.511522.
Marco Baroni, Francis Chantree, Adam Kilgarriff & Serge Sharoff (2008):
Cleaneval: a Competition for Cleaning Web Pages.
In: Proceedings of the International Conference on Language Resources and Evaluation (LREC'08).
European Language Resources Association,
pp. 638–643.
Available at http://www.lrec-conf.org/proceedings/lrec2008/summaries/162.html.
Radek Burget & Ivana Rudolfova (2009):
Web Page Element Classification Based on Visual Features.
In: Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems (ACIIDS'09).
IEEE Computer Society,
Washington, DC, USA,
pp. 67–72,
doi:10.1109/ACIIDS.2009.71.
Eduardo Cardoso, Iam Jabour, Eduardo Laber, Rogério Rodrigues & Pedro Cardoso (2011):
An efficient language-independent method to extract content from news webpages.
In: Proceedings of the 11th ACM symposium on Document Engineering (DocEng'11).
ACM,
New York, NY, USA,
pp. 121–128,
doi:10.1145/2034691.2034720.
Soumen Chakrabarti (2001):
Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction.
In: Proceedings of the 10th International Conference on World Wide Web (WWW'01).
ACM,
New York, NY, USA,
pp. 211–220,
doi:10.1145/371920.372054.
W3C Consortium (1997):
Document Object Model (DOM).
Available from URL: http://www.w3.org/DOM/.
Adriano Ferraresi, Eros Zanchetta, Marco Baroni & Silvia Bernardini (2008):
Introducing and evaluating ukWaC, a very large web-derived corpus of english.
In: Proceedings of the 4th Web as Corpus Workshop (WAC-4),
pp. 47–54.
David Gibson, Kunal Punera & Andrew Tomkins (2005):
The volume and evolution of web page templates.
In: Allan Ellis & Tatsuya Hagino: Proceedings of the 14th International Conference on World Wide Web (WWW'05).
ACM,
pp. 830–839,
doi:10.1145/1062745.1062763.
Thomas Gottron (2008):
Content Code Blurring: A New Approach to Content Extraction.
In: A. Min Tjoa & Roland R. Wagner: Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA'08).
IEEE Computer Society,
pp. 29–33,
doi:10.1109/DEXA.2008.43.
David Insa, Josep Silva & Salvador Tamarit (2013):
Using the words/leafs ratio in the DOM tree for content extraction.
The Journal of Logic and Algebraic Programming 82(8),
pp. 311–325,
doi:10.1016/j.jlap.2013.01.002.
Christian Kohlschütter (2009):
A densitometric analysis of web template content.
In: Juan Quemada, Gonzalo León, Yoëlle S. Maarek & Wolfgang Nejdl: Proceedings of the 18th International Conference on World Wide Web (WWW'09).
ACM,
pp. 1165–1166,
doi:10.1145/1526709.1526909.
Christian Kohlschütter, Peter Fankhauser & Wolfgang Nejdl (2010):
Boilerplate detection using shallow text features.
In: Brian D. Davison, Torsten Suel, Nick Craswell & Bing Liu: Proceedings of the 3th International Conference on Web Search and Web Data Mining (WSDM'10).
ACM,
pp. 441–450,
doi:10.1145/1718487.1718542.
Christian Kohlschütter & Wolfgang Nejdl (2008):
A densitometric approach to web page segmentation.
In: James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi & Abdur Chowdhury: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08).
ACM,
pp. 1173–1182,
doi:10.1145/1458082.1458237.
Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares Silva & Alberto Henrique Frade Laender (2004):
Automatic web news extraction using tree edit distance.
In: Proceedings of the 13th International Conference on World Wide Web (WWW'04).
ACM,
New York, NY, USA,
pp. 502–511,
doi:10.1145/988672.988740.
Kuo Chung Tai (1979):
The Tree-to-Tree Correction Problem.
Journal of the ACM 26(3),
pp. 422–433,
doi:10.1145/322139.322143.
Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, João M. B. Cavalcanti & Juliana Freire (2006):
A fast and robust method for web page template detection and removal.
In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06).
ACM,
New York, NY, USA,
pp. 258–267,
doi:10.1145/1183614.1183654.
Tim Weninger, William Henry Hsu & Jiawei Han (2010):
CETR: Content Extraction via Tag Ratios.
In: Michael Rappa, Paul Jones, Juliana Freire & Soumen Chakrabarti: Proceedings of the 19th International Conference on World Wide Web (WWW'10).
ACM,
pp. 971–980,
doi:10.1145/1772690.1772789.
Lan Yi, Bing Liu & Xiaoli Li (2003):
Eliminating noisy information in Web pages for data mining.
In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD'03).
ACM,
New York, NY, USA,
pp. 296–305,
doi:10.1145/956750.956785.