Bypass missing APIs through scraping

By Mads Opheim

Elevator Pitch

You can’t trust webpages to provide you with their history - but you can keep track yourself

Description

If you’re interested in keeping track of how a webpage changes day by day, you can’t really trust the data provider to provide you with the history. To ensure you’ll get hold on that information, you’ll probably have to keep track yourself.

I’ll go through how I did just that on two different domains, using cloud and scraping, and show you both code, config and a running application. Combining scraping tools in Python with serverless and suitable tools from Google Cloud Platform, we can bypass the lack of existing APIs.

We’ll spend quite some time on the tricky details and strange errors that took quite some time to master, as the devil’s in the details.

Eventually, we’ll have our own REST API, powered by GCP, providing us the data we want - the way we want them.

In one of the domains, we’ll scrape a list of events, and show how visualizing the data can reveal some interesting twists in the data. We’ll see that visualizing the data is a powerful debugging and testing technique. In the other domain, we’ll keep track of how cashback bonuses fluctuate.

Notes

Hi! Thanks for reviewing my talk! Here are my notes:

Technologies used include: + Python + Scraping using BeautifulSoup + Serverless using Google Cloud Functions + Google Cloud Cron jobs + Google Cloud storage REST-API + Vue and Axios for frontend + Google Cloud Firestore + Google Cloud Run with Java and Quarkus (for the backend-for-frontend)

I’ll talk about challenges such as + going from “oh no I can’t do that there’s no API anymore, we need to scrap the entire idea” to “screw it, I can do this” + finding what you need from a kind-of-unreadable HTML response + handling strange time out-errors + what do you do when the page suddenly does not show all the data for one specific host? + adaptations needed to run your code locally as well as on GCP functions + programming in languages you don’t know beforehand + yes, you can do this. But should you?

Running versions of the applications can be found at https://ecstatic-perlman-c2e3d9.netlify.com/#/ and http://viatrumf.madsopheim.com/