Web scrapers, or programs that retrieve information from web pages, can be a tedious thing to build mostly because most websites don’t want to be scrapped. They treasure their valuable data and don’t want to share it, web scrapers never buy your stuff or see your ads and most of all they generate a lot of web traffic without any benefit. For this project, we needed to build a generic web scraper, one that works on many sites. The goal was to receive a base URL of a page, find the search field for that page, execute a query and return the links of the pages that were the result of this query.
So we had to deal with the many different defenses that sites come up with against web scrapers, we had to find a generic way of finding the search bar and the format of the search query and we had to find a generic way of separating URLs of search results form all other URLs on a webpage.
Luckily none of these challenges proved to difficult and we were able to quickly develop a proof of concept which was later transformed in a usable API.


