References

  1. Ziv Bar-Yossef & Sridhar Rajagopalan (2002): Template detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web (WWW'02). ACM, New York, NY, USA, pp. 580–591, doi:10.1145/511446.511522.
  2. Marco Baroni, Francis Chantree, Adam Kilgarriff & Serge Sharoff (2008): Cleaneval: a Competition for Cleaning Web Pages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC'08). European Language Resources Association, pp. 638–643. Available at http://www.lrec-conf.org/proceedings/lrec2008/summaries/162.html.
  3. Radek Burget & Ivana Rudolfova (2009): Web Page Element Classification Based on Visual Features. In: Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems (ACIIDS'09). IEEE Computer Society, Washington, DC, USA, pp. 67–72, doi:10.1109/ACIIDS.2009.71.
  4. Eduardo Cardoso, Iam Jabour, Eduardo Laber, Rogério Rodrigues & Pedro Cardoso (2011): An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM symposium on Document Engineering (DocEng'11). ACM, New York, NY, USA, pp. 121–128, doi:10.1145/2034691.2034720.
  5. Soumen Chakrabarti (2001): Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: Proceedings of the 10th International Conference on World Wide Web (WWW'01). ACM, New York, NY, USA, pp. 211–220, doi:10.1145/371920.372054.
  6. W3C Consortium (1997): Document Object Model (DOM). Available from URL: http://www.w3.org/DOM/.
  7. Adriano Ferraresi, Eros Zanchetta, Marco Baroni & Silvia Bernardini (2008): Introducing and evaluating ukWaC, a very large web-derived corpus of english. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4), pp. 47–54.
  8. David Gibson, Kunal Punera & Andrew Tomkins (2005): The volume and evolution of web page templates. In: Allan Ellis & Tatsuya Hagino: Proceedings of the 14th International Conference on World Wide Web (WWW'05). ACM, pp. 830–839, doi:10.1145/1062745.1062763.
  9. Thomas Gottron (2008): Content Code Blurring: A New Approach to Content Extraction. In: A. Min Tjoa & Roland R. Wagner: Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA'08). IEEE Computer Society, pp. 29–33, doi:10.1109/DEXA.2008.43.
  10. David Insa, Josep Silva & Salvador Tamarit (2013): Using the words/leafs ratio in the DOM tree for content extraction. The Journal of Logic and Algebraic Programming 82(8), pp. 311–325, doi:10.1016/j.jlap.2013.01.002.
  11. Christian Kohlschütter (2009): A densitometric analysis of web template content. In: Juan Quemada, Gonzalo León, Yoëlle S. Maarek & Wolfgang Nejdl: Proceedings of the 18th International Conference on World Wide Web (WWW'09). ACM, pp. 1165–1166, doi:10.1145/1526709.1526909.
  12. Christian Kohlschütter, Peter Fankhauser & Wolfgang Nejdl (2010): Boilerplate detection using shallow text features. In: Brian D. Davison, Torsten Suel, Nick Craswell & Bing Liu: Proceedings of the 3th International Conference on Web Search and Web Data Mining (WSDM'10). ACM, pp. 441–450, doi:10.1145/1718487.1718542.
  13. Christian Kohlschütter & Wolfgang Nejdl (2008): A densitometric approach to web page segmentation. In: James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi & Abdur Chowdhury: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). ACM, pp. 1173–1182, doi:10.1145/1458082.1458237.
  14. Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares Silva & Alberto Henrique Frade Laender (2004): Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web (WWW'04). ACM, New York, NY, USA, pp. 502–511, doi:10.1145/988672.988740.
  15. Kuo Chung Tai (1979): The Tree-to-Tree Correction Problem. Journal of the ACM 26(3), pp. 422–433, doi:10.1145/322139.322143.
  16. Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, João M. B. Cavalcanti & Juliana Freire (2006): A fast and robust method for web page template detection and removal. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06). ACM, New York, NY, USA, pp. 258–267, doi:10.1145/1183614.1183654.
  17. Tim Weninger, William Henry Hsu & Jiawei Han (2010): CETR: Content Extraction via Tag Ratios. In: Michael Rappa, Paul Jones, Juliana Freire & Soumen Chakrabarti: Proceedings of the 19th International Conference on World Wide Web (WWW'10). ACM, pp. 971–980, doi:10.1145/1772690.1772789.
  18. Lan Yi, Bing Liu & Xiaoli Li (2003): Eliminating noisy information in Web pages for data mining. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD'03). ACM, New York, NY, USA, pp. 296–305, doi:10.1145/956750.956785.

Comments and questions to: eptcs@eptcs.org
For website issues: webmaster@eptcs.org