

We want to get rid of it and keep only a part: the text that carries relevant information. These HTML elements that appear on almost all your pages are called boilerplate. When you read an article about John Coltrane’s last posthumous album, for example, you ignore the menus, the footer, … and obviously, you aren’t viewing the whole HTML content. However, we want to extract text that makes sense, that is as informative as possible. If you want to contourn this issue by taking advantage of our JS crawler in your crawls, I suggest you read “How to crawl a site in JavaScript?”. If you ignore for a few minutes the fact that more and more sites use JavaScript rendering engines like Vue.js or React, parsing HTML is not very complicated. With a few lines of Python, for example, a couple of regular expressions (regexp), a parsing library like BeautifulSoup4, and it’s done. ProblemĮxtracting text content from a web page might seem simple. Through this article, I propose to explore the problem and to discuss some tools and recommendations to achieve this task.

When we talk about web pages, this includes the HTML, JavaScript, menus, media, header, footer, … Automatically and correctly extracting content is not easy. The first step in this adventure is to extract the text content of the web pages that these machine learning models will use.
