I’m currently contracted to create a web service using some data from a third party Angular application. I worked off a proof of concept codebase that used Chrome’s new Puppeteer API to scrape this site. I strongly regret not starting from scratch.
Puppeteer is a Node API that allows you to control Google’s headless Chrome browser. Imagine a version of your browser that can send and receive requests but has no GUI. It works in the background, performing actions as instructed by an API. This makes Puppeteer great for end to end testing a web application. You can truly simulate the user experience, typing where they type and clicking where they click. Another use case for Puppeteer is web scraping a single page web application. Let’s explore how this might work.
For any Puppeteer project, the first task is to create an instance of the headless browser.
After that’s done, it’s trivial to navigate to and begin interacting with a webpage. Let’s suppose I want to fill out a login form and submit an authentication request to a website. In an ideal situation, it looks like this.
We just successfully filled out a form, submitted an HTTP request containing our form data, and waited for the page to change upon successful login. This is where Puppeteer shines. Let’s look at a more complicated example.
Here is a modified version of some code I wrote for my client:
The first thing to notice is that the code is messy and full of weird exceptions. To make matters worse, it doesn’t always work.
Working with Puppeteer was an exercise in guesswork. Given the same inputs, Puppeteer did not always produce the same outputs1. This flaky behavior made the project unnecessarily challenging and required me to do additional engineering to increase reliability, which was frustrating considering the alternative.
Puppeteer has a API that allows you to execute arbitrary code against the DOM. After scraping form results with this API and getting the same flaky behavior described above, I ditched the approach and started grabbing data from the HTTP response objects themselves. Below is a primitive version of some code I wrote to do this.
The page passed into this function is the same page object created above by the Puppeteer API. Among other things, it is an event emitter that allows me to listen for any HTTP responses. I’ve essentially created a promise that resolves to the response body of a particular ajax request. This allowed me interact with the server API directly and removed the DOM from the data retrieval process, greatly reducing the chance for flaky behavior. But that begs the question, why use Puppeteer at all? Why not simply send http requests to the server API manually and ditch the complicated form submission code above That’s how I should have started all along.
If, however, all you need is data from the server, go the simple route and hit the API with an HTTP library like axios or request. If you are scraping a server side rendered application, you can combine one of the aforementioned tools with Cheerio, giving you a far more user friendly DOM manipulation API than that offered my Puppeteer.
1: You might be curious why Puppeteer exhibited flaky behavior. One issue that I ran into was animations. I might attempt to click a DOM node, but if the animation had not finished, the click would not register. In essence, it would appear as if the click had worked, but once the animation finished, the DOM would reset itself, undoing my click. I think this is simply how Angular’s digest loop reacted to a click at an unexpected time. Unfortunately, Puppeteer provides no functionality to determine when an animation has finished (after all, how would it!). I tried a couple solutions. One entailed a sleep function to wait a couple hundred milliseconds for the animation to finish, but it simply exacerbated the flaky behavior. A second involved executing the click only once the DOM node had a particular class that indicated the animation had finished. At one point, I even tried to disable all animations across the website. All these solutions were half-baked.