BlueBoard's scraping engine refreshes the content of every single page that it monitors 10 times per day on average, every day. Some sites use anti-scraping techniques and a big part of BlueBoard's job is to find ways around them.
All relevant information (prices, stocks, reviews…) is extracted from the page by a site-specific piece of code. Whenever a new website is added to BlueBoard, a new piece of data-extraction code has to be written by our back-end developers.
This piece of code will format the HTML code of the product pages into the expected datas. Although many sites have some elements in common, local adjustments need to be made frequently to account for new and changing site structures.
For each product page, the model is designed to collect the following information: product name, image, price, availability, users reviews, marketplace and buy box status. All the collected data are stored in a database that feeds the various user applications (the dashboard, the notifications, the API).
The team at BlueBoard closely monitors the scraping process to guarantee the freshest and most accurate data. There is a variety of reason why our engine may fail to scrape a page:
- The page was deleted or moved: we will try to find a new page or mark the product as permanently out of stock.
- The structure of the site evolved: we adjust the model to take the changes into account.
- The site introduced some new anti-scraping techniques: we will use a different scraping method.
Blueboard has clients all over the world and has acquired the competence to interpret many different languages and site structures.