Scraping the Web with Puppeteer: Lessons Learned

I’m currently contracted to create a web service using some data from a third party Angular application. I worked off a proof of concept codebase that used Chrome’s new Puppeteer API to scrape this site. I strongly regret not starting from scratch.

What is Puppeteer?

Puppeteer is a Node API that allows you to control Google’s headless Chrome browser. Imagine a version of your browser that can send and receive requests but has no GUI. It works in the background, performing actions as instructed by an API. This makes Puppeteer great for end to end testing a web application. You can truly simulate the user experience, typing where they type and clicking where they click. Another use case for Puppeteer is web scraping a single page web application. Let’s explore how this might work.

A Simple Example

For any Puppeteer project, the first task is to create an instance of the headless browser.

1
2
3
4
const browser = await puppeteer.launch()
// We will use this page instance and it's API frequently
const page = await this.browser.newPage();

After that’s done, it’s trivial to navigate to and begin interacting with a webpage. Let’s suppose I want to fill out a login form and submit an authentication request to a website. In an ideal situation, it looks like this.

1
2
3
4
5
6
7
8
9
10
11
12
// Navigate to the website
await page.goto("https://website/login");
// Provide the selector of an input box and the content to type
await page.type("input#username", CREDENTIALS.username);
await page.type("input#password", CREDENTIALS.password);
await page.click("button#login"); // Click the login button
// Wait until the screen changes and a node matching
// the selector #logged-in-successfully appears,
// at which point we know the login was successful
await page.waitForSelector("#logged-in-successfully");

We just successfully filled out a form, submitted an HTTP request containing our form data, and waited for the page to change upon successful login. This is where Puppeteer shines. Let’s look at a more complicated example.

A More Complicated Example

Today’s web uses a mix of simple html driven forms as well as more complicated, javascript-driven forms, rich with functionality like autocomplete and dynamic dropdown menus. I’m sure you’ve used a calendar date picker. These components each need to be treated differently.

Here is a modified version of some code I wrote for my client:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// The form inputs I want to fill out
// and their corresponding selectors
const fields = {
type: 'input[ng-model="type"]',
origin: 'input[ng-model="origin"]',
destination: 'input[ng-model="destination"]',
date: 'input[ng-model="date"]'
};
function search(searchParams) {
const page = await this.browser.newPage();
await page.goto("https://website/search");
// We need to click a button to make the search form
// appear, but first make sure that button has rendered
await page.waitForSelector(".new-search");
await page.click(".new-search");
// Fill out the form
for (const field of Object.keys(fields)) {
if (searchParams[field]) {
const selector = fields[field];
// We want to make sure each DOM node is rendered
// before we try to do anything to it.
await page.waitForSelector(selector);
// Some inputs need to be focused first for page.type to work
// Might as well focus on all of them
await page.focus(selector);
// Some inputs have defaults that need to be
// erased before typing your own input
if (field === "date" || field === "type") {
// This is a helper function I wrote
await deleteInput(page, selector);
}
await page.type(selector, searchParams[field]);
// This field won't register the typed data
// until Enter is pressed.
// This is because the 'type' field is a dropdown
// where one of a specific set of inputs must be clicked.
if (field === "type") {
await page.keyboard.press("Enter");
}
}
}
}

The first thing to notice is that the code is messy and full of weird exceptions. To make matters worse, it doesn’t always work.

Brace Yourself for Flaky Behavior

Working with Puppeteer was an exercise in guesswork. Given the same inputs, Puppeteer did not always produce the same outputs1. This flaky behavior made the project unnecessarily challenging and required me to do additional engineering to increase reliability, which was frustrating considering the alternative.

Puppeteer Was Completely Unnecessary

Puppeteer has a API that allows you to execute arbitrary code against the DOM. After scraping form results with this API and getting the same flaky behavior described above, I ditched the approach and started grabbing data from the HTTP response objects themselves. Below is a primitive version of some code I wrote to do this.

1
2
3
4
5
6
7
8
9
10
waitForUrl(page, urlPrefix) {
return new Promise(resolve => {
page.on("response", res => {
if (!res.url.startsWith(urlPrefix)) {
return;
}
resolve(res.json());
};
});
}

The page passed into this function is the same page object created above by the Puppeteer API. Among other things, it is an event emitter that allows me to listen for any HTTP responses. I’ve essentially created a promise that resolves to the response body of a particular ajax request. This allowed me interact with the server API directly and removed the DOM from the data retrieval process, greatly reducing the chance for flaky behavior. But that begs the question, why use Puppeteer at all? Why not simply send http requests to the server API manually and ditch the complicated form submission code above That’s how I should have started all along.

When is Puppeteer the Right Solution

I can only think of a single scenario where using Puppeteer for scraping is superior to the alternative: if the information you want is generated using a combination of API data and javascript code. After all, you would have no other way to simulate the javascript code without rewriting it.

If, however, all you need is data from the server, go the simple route and hit the API with an HTTP library like axios or request. If you are scraping a server side rendered application, you can combine one of the aforementioned tools with Cheerio, giving you a far more user friendly DOM manipulation API than that offered my Puppeteer.

Footnotes

1: You might be curious why Puppeteer exhibited flaky behavior. One issue that I ran into was animations. I might attempt to click a DOM node, but if the animation had not finished, the click would not register. In essence, it would appear as if the click had worked, but once the animation finished, the DOM would reset itself, undoing my click. I think this is simply how Angular’s digest loop reacted to a click at an unexpected time. Unfortunately, Puppeteer provides no functionality to determine when an animation has finished (after all, how would it!). I tried a couple solutions. One entailed a sleep function to wait a couple hundred milliseconds for the animation to finish, but it simply exacerbated the flaky behavior. A second involved executing the click only once the DOM node had a particular class that indicated the animation had finished. At one point, I even tried to disable all animations across the website. All these solutions were half-baked.

Regex And Automated Test Fuzzing

I posted my article Build Your Own Regex Engine on Reddit the other day, and one of the commenters claimed that the implementation should be trivial to break. Since I had already tested my program against a customized suite of tests, the remark got me thinking about how I could further increase my confidence in the correctness of my program. One extremely low cost however effective strategy for identifying faults in software is known as fuzzing.

What is Fuzzing?

Fuzzing is a automated testing technique where a program is provided a series of invalid or randomly generated inputs. If we were testing an HTTP API, we might send randomized combinations of query parameters and ensure that our server always returns a 2xx status code. Since Javascript comes with a regular expression engine, my fuzzer asserts that given the same random input, both engine’s return the same output.

Specifying the Grammar

The first step is to specify the grammar that our regex engine supports.

1
2
3
4
const lowercase = "abcdefghijklmnopqrstuvwxyz".split("");
const uppercase = lowercase.map(letter => letter.toUpperCase());
const special = ["?", "*", "."];
const regexGrammar = special.concat(lowercase, uppercase);

You might notice we skipped the ^ and $ characters. More on these in a little bit.

Generated Valid Regular Expressions

We want to write a function generateRegex that will select n random characters from the regexGrammar and concatenate them together into a string. This string will be used to create a test regex.

Here are three possible returns values of generateRegex:

  1. .AnrQ?QNLQX.syBsOcJlbJZd
  2. .LkuZ?Ynj
  3. .UN?eiyddhXvyNj
1
2
3
4
5
6
7
8
9
10
11
12
13
function generateRegex(n) {
let regexString = new Array(n)
.fill(0)
.map(chooseOne)
.join("");
return regexString;
}
// Pick one element randomly from the grammar and return it
function chooseOne() {
return regexGrammar[Math.floor(Math.random() * regexGrammar.length)];
}

Removing Invalid Regex Strings

My regex engine only deals with a very small subset of available regex syntax, and furthermore, it does not contain any error handling. What happens if generateRegex returns the pattern ** or ^*? My regex engine was never designed to handle these inputs, though they are possible outputs of generateRegex. We need to make a choice about how to handle these expressions. Since the primary goal of my regex engine is accessibility and simplicity of implementation, I’m not about to begin supporting these edge cases. That means my fuzzer should not generate them either.

One solution to determine if a given regex string is valid is to specify my regex engine’s allowable grammar in BNF. BNF is a formal notation for specifying the syntax of a language. Given this BNF notation, I could ask another program if the randomly generated regex string can be created using my BNF specification. This sounds like a little more work than I want, however, since the invalid cases can simply be manually enumerated and filtered.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
function validRegex(regexString) {
return (
// None of the following sequences are properly
// defined by my regex engine
regexString.indexOf("**") === -1 &&
regexString.indexOf("??") === -1 &&
regexString.indexOf("*?") === -1 &&
regexString.indexOf("?*") === -1 &&
regexString.indexOf("^?") === -1 &&
regexString.indexOf("^*") === -1 &&
!regexString.startsWith("*") &&
!regexString.startsWith("?")
);
}
function generateRegex(n) {
let regexString = new Array(n)
.fill(0)
.map(chooseOne)
.join("");
// If the generated string is valid, return it
if (validRegex(regexString)) {
return regexString;
// Otherwise generate a new string and return that
} else {
return generateRegexString(n);
}
}

One more modification to generateRegex is necessary to support ^ and $, and then we are basically done.

1
2
3
4
5
6
7
8
9
function generateRegex(n) {
...
// We need to ensure that '^' and '$' only go at the beginning
// and the end of the string, respectively.
// Give each a 10% probability of appearing in a string
if (Math.random() < 0.1) regexString = "^" + regexString;
if (Math.random() < 0.1) regexString = regexString + "$";
...
}

Comparing Regex Implementations

All that is required now is to repeatedly invoke generateRegex a fixed number of times and then compare the output of the native JS implementation with the output of my implementation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// The corpus is the string of text we are matching the pattern against.
// I used a segment of Gulliver's Travels from Project Gutenberg.
function fuzzer(totalTests, corpus) {
const maxRegexLength = 50; // max will actually be 50 - 1
let testsRun = 0;
while (testsRun < totalTests) {
const regexLength = getRandomInt(1, maxRegexLength);
const regexString = generateRegexString(regexLength);
const testRegex = new RegExp(regexString);
try {
assert.equal(testRegex.test(corpus), myRegexEngine(regexString, corpus));
} catch (err) {
console.log(testRegex);
}
testsRun++;
}
}
// Thank you Mozzila :)
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/random
function getRandomInt(min, max) {
min = Math.ceil(min);
max = Math.floor(max);
return Math.floor(Math.random() * (max - min)) + min;
}

Results

I ran my fuzzer for a couple million randomly generated cases and ended up learning two things about my regex engine.

  1. My implementation fails extraordinarily with longer texts. I knew recursion would be a problem for any practical regex implementation (at least without tail calls) and would cause stack overflows, but I didn’t expect it to fail with texts that were only a couple thousand words. I think this is because I make liberal use of backtracking algorithms in matchQuestion and matchStar. Since I was forced to test with a relatively short input text, it makes sense to use multiple text inputs to increase the probability of discovering an error.

  2. My implementation treats the . character differently than the native implementation. In the RegExp implementation, . will not match various line terminators (\n, \r, \u2028 or \u2029). My implementation does.

Conclusion

The biggest takeaway is that fuzzing is an simple and inexpensive way to enumerate enormous sets of inputs and identify bugs in your software. This fuzzer took less than an hour to write.

But remember, this fuzzer’s blessing of a couple million input combinations does not verify the correctness of my program. Not even close. A fuzzer is a tool to identify potential errors. Unless you enumerate all possible inputs (completely impossible in this case where they are infinite), you are not guaranteed your program is error free.

Build a Regex Engine in Less than 40 Lines of Code

I stumbled upon an article the other day where Rob Pike implements a rudimentary regular expression engine in c. I converted his code to Javascript and added test specs so that someone can self-guide themselves through the creation of the regex engine. The specs and solution can be found in this GitHub repository. This blog post walks through my solution.

The Problem

Our regex engine will support the following syntax:

Syntax Meaning Example matches
a Matches the specified character literal q q
* Matches 0 or more of the previous character a* “”, a, aa, aaa
? Matches 0 or 1 of the previous character a? “”, a
. Matches any character literal . a, b, c, d, e …
^ Matches the start of a string ^c c, ca, caa, cbb …
$ Matches the end of a string a$ ba, baaa, qwerta …

The goal is to provide a syntax robust enough to match a large portion of regex use cases with minimal code.

Matching One Character

The first step is to write a function that takes in a one character pattern and a one character text string and returns a boolean indicating if they match. A pattern of . is considered a wildcard and matches against any character literal.

Here are some examples

matchOne('a', 'a') -> true
matchOne('.', 'z') -> true
matchOne('', 'h') -> true
matchOne('a', 'b') -> false
matchOne('p', '') -> false

1
2
3
4
5
6
function matchOne(pattern, text) {
if (!pattern) return true // Any text matches an empty pattern
if (!text) return false // If the pattern is defined but the text is empty, there cannot be a match
if (pattern === ".") return true // Any inputted text matches the wildcard
return pattern === text
}

Matching Same Length Strings

Now we want to add support for patterns and text strings of greater length. For now, let’s only consider a pattern/text pair of the same length. I happen to know that the solution lends itself very naturally to recursion, so we will use it here. We are going to want to repeatedly invoke matchOne on successive pairs of characters from the pattern/text combination.

1
2
3
4
function match(pattern, text) {
if (pattern === "") return true // Our base case - if the pattern is empty, any inputted text is a match
else return matchOne(pattern[0], text[0]) && match(pattern.slice(1), text.slice(1))
}

The above code advances character by character across the the pattern/text pair. It first compares pattern[0] to text[0] and then pattern[1] to text[1] and continues comparing pattern[i] to text[i] until i === pattern.length - 1. If they ever don’t match, then we know that the pattern cannot match the text.

Let’s take an example. Suppose we invoke match('a.c', 'abc'), which returns matchOne('a', 'a') && match('.c', 'bc').

If we continue evaluating these functions, we get matchOne('a', 'a') && matchOne('.', 'b') && matchOne('c', 'c') && match("", ""), which is just equal to true && true && true && true, So we have a match!

The $ Character

Let’s add support for the special pattern character $ that allows us to match the end of a string. The solution simply requires adding an additional base case to the match function.

1
2
3
4
5
function match(pattern, text) {
if (pattern === "") return true
if (pattern === "$" && text === "") return true
else return matchOne(pattern[0], text[0]) && match(pattern.slice(1), text.slice(1))
}

The ^ Character

Let’s add support for the special pattern character ^ that allows us to match the beginning of a string. I’m going to introduce a new function called search.

1
2
3
4
5
function search(pattern, text) {
if (pattern[0] === "^") {
return match(pattern.slice(1), text)
}
}

This function will be the new entry point to our code. Up till this point, we were only matching patterns that began at the beginning of the text. We are simply making that more clear now by forcing the user to preface the pattern with a ^. But how do we support patterns that appear anywhere within the text?

Matches Starting Anywhere

Currently, the following return true

search("^abc", "abc")
search("^abcd", "abcd")

But search("bc", "abcd") will just return undefined. We want it to return true

If the user does not specify that the pattern matches the beginning of the text, then we want to search for that pattern at every possible starting point within the text. We will default to this behavior if the pattern does not begin with ^1.

1
2
3
4
5
6
7
8
9
10
11
function search(pattern, text) {
if (pattern[0] === "^") {
return match(pattern.slice(1), text)
} else {
// This code will run match(pattern, text.slice(index)) on every index of the text.
// This means that we test the pattern against every starting point of the text.
return text.split("").some((_, index) => {
return match(pattern, text.slice(index))
})
}
}

The ? Character

We want to be able to match 0 to 1 of the character before ?.

Here are some examples

search("ab?c", "ac") -> true
search("ab?c", "abc") -> true
search("a?b?c?", "abc") -> true
search("a?b?c?", "") -> true

The first step is to modify match to detect when a ? character is present and then delegate to the matchQuestion function, which we will define shortly.

1
2
3
4
5
6
7
8
9
10
11
12
13
function match(pattern, text) {
if (pattern === "") {
return true
} else if (pattern === "$" && text === "") {
return true
// Notice that we are looking at pattern[1] instead of pattern[0].
// pattern[0] is the character to match 0 or 1 of.
} else if (pattern[1] === "?") {
return matchQuestion(pattern, text)
} else {
return matchOne(pattern[0], text[0]) && match(pattern.slice(1), text.slice(1))
}
}

matchQuestion needs to handle two cases:

  1. Where the character before the ? is not matched but the text matches the remainder of the pattern (everything after the ?).
  2. Where the character before the ? is matched and the rest of the text (minus the 1 matched character) matches the remainder of the pattern.

If either of these cases is truthy, then matchQuestion can return true.

Let’s consider the first case. How do we check if the text matches everything in the pattern except the _? syntax? In order words, how do we check if the character before the ? appears 0 times? We strip 2 characters off the pattern (the first character is the one before the ? and the second is the ? itself) and invoke the match function.

1
2
3
function matchQuestion(pattern, text) {
return match(pattern.slice(2), text);
}

The second case is a little more challenging, but just like before, it reuses functions we’ve already written

1
2
3
4
5
6
7
function matchQuestion(pattern, text) {
if (matchOne(pattern[0], text[0]) && match(pattern.slice(2), text.slice(1))) {
return true;
} else {
return match(pattern.slice(2), text);
}
}

If the text[0] matches pattern[0], and the rest of the text (minus the part that is matched by matchOne) matches the remainder of the pattern, then we are golden. Note that we could rewrite the code like this:

1
2
3
function matchQuestion(pattern, text) {
return (matchOne(pattern[0], text[0]) && match(pattern.slice(2), text.slice(1))) || match(pattern.slice(2), text);
}

The one thing I like about this latter approach is that the boolean OR makes it explicitly clear that there are two cases, either of which may be true.

The * Character

We want to be able to match the character before the * 0 or more times.

All of these should return true.

search("a*", "")
search("a*", "aaaaaaa")
search("a*b", "aaaaaaab")

Similar to what we did when supporting ?, we wan to delegate to a matchStar function within our match function

1
2
3
4
5
6
7
8
9
10
11
12
13
function match(pattern, text) {
if (pattern === "") {
return true
} else if (pattern === "$" && text === "") {
return true
} else if (pattern[1] === "?") {
return matchQuestion(pattern, text)
} else if (pattern[1] === "*") {
return matchStar(pattern, text)
} else {
return matchOne(pattern[0], text[0]) && match(pattern.slice(1), text.slice(1))
}
}

matchStar, like matchQuestion, also needs to handle two cases:

  1. Where the character before the * is not matched but the text matches the remainder of the pattern (everything after the *).
  2. Where the character before the * is matched one or more times and the rest of the text matches the remainder of the pattern.

Since there are two cases that both result in a match (0 matches OR more matches), we know that matchStar can be implemented with a boolean OR. Furthermore, case 1 for matchStar is exactly the same as it was for matchQuestion and can be implemented identically using match(pattern.slice(2), text). That means we only need to formulate an expression that satisfies case 2.

1
2
3
function matchStar(pattern, text) {
return (matchOne(pattern[0], text[0]) && match(pattern, text.slice(1))) || match(pattern.slice(2), text);
}

Refactoring

We can now go back and cleverly simplify search using a trick I learned in Peter Norvig’s class.

1
2
3
4
5
6
7
function search(pattern, text) {
if (pattern[0] === "^") {
return match(pattern.slice(1), text)
} else {
return match(".*" + pattern, text)
}
}

We use the * character itself to allow for the pattern to appear anywhere in the string. The prepended .* says that any number of any character can appear before the pattern we wish to match.

Conclusion

It’s remarkable how simple and elegant the code for such a sophisticated and generalized program can be. The full source is available in this GitHub repository

A follow up where I fuzz test the regex engine can be found here

Footnotes

1: There is a small bug in this code that I’m choosing to ignore. We don’t account for the case that text is an empty string. Currently when text === '', text.split("") will return [] and will not appropriately call match.

Write Your Own React-Redux Connect

My inspiration for this blog post came from this video where Dan Abramov walks through the source code to react-redux

As frontend web developers, it’s not uncommon that we follow well-specified patterns - often blindly. The frontend landscape is changing rapidly, and sometimes there isn’t time to investigate why we use a specific pattern; we just know we should.

One widely used pattern in react-redux applications looks like this

1
connect(mapStateToProps, mapDispatchToProps)(MyComponent)

I’ll assume you know how to implement this pattern, but why do we use it and how does it work under the hood?

Why Do we Need React-Redux?

React and Redux are two completely independent tools that have nothing to do with each other. React is a tool for creating user interfaces in the browser. Redux is a tool for managing state. Either tool can be used without the other. We often use them together because they both solve separate but very important and closely related problems. The purpose of react-redux is to get these two tools to talk.

But first, what would we do without react-redux? How would React and Redux talk?

How to Integrate React and Redux Without react-redux

More precisely, how do we ensure that a React component re-renders when the Redux store changes? The answer lies in Redux’s subscribe API.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import store from './store'
import { Component } from 'react'
class MyComponent extends Component {
constructor() {
super();
// One solution is to make each component
// store the entirety of the redux state.
this.state = { storeState: store.getState() };
}
componentDidMount() {
// Callbacks passed to store.subscribe will be
// invoked every time the store's state changes.
// Our callback can get the state of the
// store and add it to the component's local state.
this.unsubscribe = store.subscribe(() => {
this.setState({ storeState: store.getState() });
});
}
// We need to make sure that we don't accidentally
// subscribe to the store multiple times in the case
// where a component mounts, unmounts, and then mounts a second time.
// Fortunately, Redux makes this easy by returning
// an unsubscribe function when store.subscribe is invoked.
componentWillUnmount() {
this.unsubscribe();
}
}

If we insert the above boilerplate into every one of our React component’s, then every component could have access to the store and would be informed through a subscription the moment the store’s state changes. This configuration has three flaws.

  1. The boilerplate of subscribing and unsubscribing to the store is highly error prone and unnecessarily verbose.
  2. All of our React component’s are dependent upon knowledge of the Redux store. This is a complete failure of separation of concerns.
  3. Every component is dependent upon the entirety of the store’s state tree. This means that whenever an action is dispatched, setState is called on every mounted component, causing each one to re-render, regardless of whether its render function depends on the store state that changed. Woah! Let that sink in for a moment.

Let’s write a rudimentary implementation of connect that resolves the first problem.

Understanding The Syntax of Connect

Typically, we invoke connect like this:

1
connect(mapStateToProps, mapDispatchToProps)(WrappedComponent);

connect takes in two functions as arguments and returns a function. Yes, you heard me, connect returns a function, not a component. Suppose I invoke connect and neglect to pass in a component.

1
2
const connectFunc = connect(mapStateToProps, mapDispatchToProps);
const connctedComponent = connectFunc(WrappedComponent);

connect will return to me a function. It’s that function that takes in my component (connect is implemented this way as opposed to simply taking in 3 arguments to support decorator syntax. The Dan Abramov video I linked above explains this.)

Thus, the very first few lines of connect must look like this:

1
2
3
4
5
function connect(mapStateToProps, mapDispatchToProps) {
return function(WrappedComponent) {
};
}

Higher Order Components

And what does the function we returned above do? This function is implemented as a higher order component (HOC). A HOC is a function that takes in a component as a parameter and returns a new component. The new component is generally a modified or augmented version of the original component.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function connect(mapStateToProps, mapDispatchToProps) {
return function(WrappedComponent) {
// We are returning a brand new component.
// Note that this new component does
// not inherit from WrappedComponent.
return class WrapperComponent extends Component {
// All we are doing is returning a new component
// that renders our original component.
render() {
// Notice that we need to pass WrappedComponent
// WrapperComponent's props.
// If we didn't do this, then WrappedComponent
// would never have access to any props.
return <WrappedComponent {...this.props} />;
}
};
};
}

If we were to run the above connect function on a component, the connected component would behave identically to original component. Furthermore, we could nest connect as many times as we want

1
connect(null, null)(connect(null, null)(App))

and still never distort the behavior of the original component. Our current implementation is effectively idempotent.

Eliminating Boilerplate

Our next step is to eliminate some of the boilerplate code. We don’t want to have to subscribe to the store every time we create a new component, so let’s have our new connect function do it instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
function connect(mapStateToProps, mapDispatchToProps) {
return function(WrappedComponent) {
return class WrapperComponent extends Component {
constructor() {
super();
this.state = { storeState: store.getState() };
}
componentDidMount() {
this.unsubscribe = store.subscribe(() => {
this.setState({ storeState: store.getState() });
});
}
componentWillUnmount() {
this.unsubscribe();
}
render() {
// Since the whole point of this HOC is to get WrappedComponent
// access to the store, we need to pass that state down as props.
const storeState = this.state.storeState;
return <WrappedComponent {...this.props} {...storeState} />;
}
};
};
}

We just made huge progress! Now, whenever we invoke

1
connect(null, null)(MyComponent)

we get a component that is subscribed to state changes on the store, and this state will be passed down to our component as props.

Implementing Support for mapStateToProps

Our connected components still all depend on the entirety of the store’s state tree. Look up above, the entire state is passed down as props to every connected component. To reiterate, this means that if any piece of the store’s state is updated, our component will re-render.

This is where mapStateToProps comes to the rescue. mapStateToProps takes as its argument the store’s state, and it allows us to return the particular pieces of the store’s state that a component depends on. It then passes that state as props to our component instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
function connect(mapStateToProps, mapDispatchToProps) {
return function(WrappedComponent) {
return class WrapperComponent extends Component {
constructor() {
super();
this.state = { storeState: store.getState() };
}
componentDidMount() {
this.unsubscribe = store.subscribe(() => {
this.setState({ storeState: store.getState() });
});
}
componentWillUnmount() {
this.unsubscribe();
}
render() {
// Now, instead of passing down all of the store state,
// we only pass down the subset of state return from
// mapStateToProps
const storeProps = mapStateToProps(this.state.storeState);
return <WrappedComponent {...this.props} {...storeProps} />;
}
};
};
}

All we did was insert a call to mapStateToProps, allowing us to make each connected component dependent upon only the state it cares about, as defined by the return value of mapStateToProps. mapStateToProps is a wonderful form of explicit documentation, clearly stating the slices of the state tree each component depends on. Unfortunately, our change does not fix the efficiency problems noted above. More on that below.

mapStateToProps and ownProps

An astute reader might note that mapStateToProps actually takes two arguments: the first is a copy of the store’s state, and the second are the props that are originally passed down to WrapperComponent. react-redux does not pass these down to the wrapped component by default as we do in the example immediately above. Let’s modify our implementation to mirror react-redux.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function connect(mapStateToProps, mapDispatchToProps) {
return function(WrappedComponent) {
return class WrapperComponent extends Component {
constructor() {
super();
this.state = { storeState: store.getState() };
}
componentDidMount() {
this.unsubscribe = store.subscribe(() => {
this.setState({ storeState: store.getState() });
});
}
componentWillUnmount() {
this.unsubscribe();
}
render() {
const newProps = mapStateToProps(this.state.storeState, this.props);
return <WrappedComponent {...newProps} />;
}
};
};
}

Now the implementer of mapStateToProps can choose which of WrapperComponent‘s props it would like to keep and which it would like to disregard.

What’s the Point of mapDispatchToProps?

mapDispatchToProps is designed to eliminate React’s dependency upon Redux. If we were to use the above implementation of connect, every component that dispatch’s an action must import store.dispatch, and the implementation would look like this:

1
2
3
4
5
6
7
8
9
import store from "./store";
import { Component } from "react";
import { updateThing } from "./store/actions";
class ExampleComponent extends Component {
handleChange(e) {
store.dispatch(updateThing(e.target));
}
}

The above component ‘knows’ that it is part of a Redux application because it is explicitly referencing the store to dispatch actions. But we should always try to minimize the interaction of different pieces of architecture, esspecially when they have no need to interact. Ultimately, React components should not been intertwined with Redux code!

Implementing Support for mapDispatchToProps

connect resolves this problem for us by injecting the store.dispatch dependency into mapDispatchToProps, allowing us to explicitly define functions that dispatch actions without requiring that our presentation components have a dependency on the store. Just as the return value of mapStateToProps is passed down to WrappedComponent, the return value of mapDispatchToProps will be passed down as well.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
function connect(mapStateToProps, mapDispatchToProps) {
return function(WrappedComponent) {
return class WrapperComponent extends Component {
constructor() {
super();
this.state = { storeState: store.getState() };
}
componentDidMount() {
this.unsubscribe = store.subscribe(() => {
this.setState({ storeState: store.getState() });
});
}
componentWillUnmount() {
this.unsubscribe();
}
render() {
// Now we merge the results from mapStateToProps
// and mapDispatchToProps and pass everything down
const newProps = Object.assign(
{},
mapStateToProps(this.state.storeState, this.props),
// If you aren't intimately familiar with the this keyword,
// it's okay if you don't understand why we use bind here
mapDispatchToProps(store.dispatch.bind(this))
);
return <WrappedComponent {...newProps} />;
}
};
};
}

More Efficiency Issues - Hello shouldComponentUpdate

We never actually fixed any of the performance issues noted above. The crux of the problem is that every time the store updates, WrapperComponent re-renders (because of its Redux store subscription that calls setState) and that means WrappedComponent re-renders. This re-rendering happens despite the fact that WrappedComponent‘s props might be unchanged between two invocations of setState. In fact, this scenario is highly probable and will occur whenever a piece of state in the store changes that your component does not depend on (aka, a piece of store state not returned from from mapStateToProps).

React has a handy lifecycle method called shouldComponentUpdate that allows us to return a boolean that indicates whether a component should re-render. In essence, if we implement this method on WrapperComponent and it returns false, then React will not re-render WrapperComponent. And it follows that WrappedComponent won’t re-render either.

So, in the above scenario, when WrapperComponent calls setState, React first calls the shouldComponentUpdate method to see if a re-render should actually happen. Let’s implement it below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Just a simple shallow equality function
import shallowEqual from "shallow-equal/objects"
function connect(mapStateToProps, mapDispatchToProps) {
return function(WrappedComponent) {
return class WrapperComponent extends Component {
constructor() {
super();
this.state = { storeState: store.getState() };
}
shouldComponentUpdate() {
// If the props to WrapperComponent do not change
// between setState calls, then we don't need to re-render.
// On the previous re-render, we cached the results of
// mapStateToProps. That's what this.oldProps is.
const newProps = mapStateToProps(this.state.storeState, this.props);
return !shallowEqual(newProps, this.oldProps);
}
componentDidMount() {
this.unsubscribe = store.subscribe(() => {
this.setState({ storeState: store.getState() });
});
}
componentWillUnmount() {
this.unsubscribe();
}
render() {
// We need to hang onto the previous result of
// mapStateToProps to use the next time
// shouldComponentUpdate runs
this.oldProps = mapStateToProps(this.state.storeState, this.props)
const newProps = Object.assign(
{},
this.oldProps,
mapDispatchToProps(store.dispatch.bind(this))
);
return <WrappedComponent {...newProps} />;
}
};
};
}

I’ve created a demo here. Open the console and prove to yourself that shouldComponentUpdate is doing its job.

I should note that this is not exactly what react-redux does because of edge cases, but the concept is still the same.

Now our wrapper and wrapped components will only re-render when the props returned from mapStateToProps change! This is a huge performance gain. This implementation of connect explains why adherence to immutability is so important in redux’s reducers. If you fail to respect immutability, the shallow comparison in the shouldComponentUpdate in WrapperComponent will likely return false, causing your connected component to not re-render when it should.

Wrapping up

React-redux’s connect method is remarkably simple and only performs a handful of operations.

  1. It manages our component’s subscription to the store so that our component can update when the store updates.
  2. It allows us to explicitly define the slice of state our component is dependent upon using mapStateToProps.
  3. It gives our component access to store.dispatch without requiring a direct dependency on the store.
  4. It defines shouldComponentUpdate, ensuring that our components only re-render when the store state they depend on changes.

I hope you found this article helpful. Please feel free to email me and reach out if you have questions. I put a gist online containing the same code as the demo.

Using Reduce

My first introduction to functional programming was a couple years ago when I read through the famous SICP. As someone who had up to this point worked with mostly in object oriented and imperative languages, I had rarely seen map, fitler, and reduce before that time. The purpose of the former two felt obvious; the latter one not so much. This blog post is geared for someone who knows how reduce works but feels like they struggle to use it practically.

Reduce Basics

Feel free to skip this section if you understand the basics of reduce

Reduce is known as a higher order function (HOF). A HOF is defined as a function that either takes in or returns another function.

Reduce takes two parameters: the first is a function, and the second is known as the initial accumulator. We will invoke this function over every element of an array. The ultimate goal is to transform the array into something new.

Suppose we want to take an array of characters and concatenate them into a single string. In this example, the first argument we pass reduce is a HOF that will operate over every character. The HOF takes two arguments: the first is the accumulator and is the value that we ultimately want to assemble, and the second is a particular element of the array.

Let’s simulate what happens when we run a solution to the problem of concatenating strings.

1
2
3
4
const arrToConcat = ['a', 'b', 'c', 'd'];
arrToConcat.reduce(function(resultantString, nextCharacter) {
return resultantString + nextCharacter;
}, "") // initial accumulator. This is the second argument to reduce!

You can use the below table to help guide yourself along

invocation # resultantString nextCharacter return value
1 “” ‘a’ ‘a’
2 “a” ‘b’ ‘ab’
3 “ab” ‘c’ ‘abc’
4 “abc” ‘d’ ‘abcd’

We will ultimately call our HOF 4 times, passing in each of the 4 elements in the initial array.

The first time we invoke the HOF, the first argument is always the initial accumulator. Notice the "" passed into reduce above. The second argument is 'a' because that’s the first element in our array. We return ‘a’, and this return value is the resultantString that is passed as an argument to the second invocation of our HOF, along with the next element of the array, 'b'. The process continues. 'ab' is returned from the HOF’s second invocation, and 'ab' and 'c' are passed into the third invocation of the HOF. Ultimately, reduce returns 'abcd'.

We built up the result step by step through a series of concatenation steps. Every call to the HOF will return a string with one additional character added to the previous accumulator.

Know Your Return Type

I think one of the reasons map and filter are easier than reduce is because they always return an array. That’s not the case for reduce. reduce is ultimately designed to transform one type into another.

When you begin writing a reduce statement, your first question to yourself should be “What type am I returning?”. Here’s a hint, it’s probably one of the following: String, Number, Object, Array. Once you know what type you will return you can begin writing your reduce statement. I say this because the initial accumulator (reduce’s second argument) can easily be determined by what type you expect reduce to return. Here is a table to guide your decision.

return type initial accumulator
string “”
number 0
object {}
array []

Notice that the initial accumulator and the return type are always the same!

You probably won’t want return an array from a reduce function very often because chances are you could have more simply used a combination of map, filter, and/or some other function instead.

Summing Odd Numbers

So let’s say we want to reduce over a set of numbers and return the sum of all the odd numbers. Since we are returning a number, we can use the above table to determine that the starting accumulator should be 0. That gives us a pretty decent starting point!

1
2
3
4
const arrToSum = [1, 2, 3, 4, 5];
arrToSum.reduce(function(currentSum, nextNumber) {
}, 0) // This is the second argument to reduce!

Now we just need to complete the guts of the function such that it increments the sum when the nextNumber is odd. If nextNumber is even, it should return currentSum unchanged.

1
2
3
4
5
6
7
8
9
10
const arrToSum = [1, 2, 3, 4, 5];
arrToSum.reduce(function(currentSum, nextNumber) {
// nextNumber is odd
if (nextNumber % 2 === 1) {
return currentSum + nextNumber
// nextNumber is even
} else {
return currentSum
}
}, 0)

One interesting thing to note about the above code is that EVERY code path (every if-else block) returns something. This is a very common theme throughout the functional programming paradigm and holds true when writing reduce’s higher order function. In general it’s a decent litmus test to establish if you’ve made a bug somewhere. More specifically, if every code path does not return something, and if that something is not the same type as your accumulator, you probably have a bug.

Frequency Counter

Let’s do something a little different. Suppose I have an array of words, and I want to count how many times each word appears in the array. We can represent this information using an object that maps words in the array to a number indicating their frequency of occurrence.

1
2
3
4
5
6
7
8
9
10
11
// We will input something like this
['luke', 'anakin', 'chewy', 'luke', 'chewy', 'princess', 'leia', 'chewy'] ->
// And output something like this
{
luke: 2,
anakin: 1,
princess: 1,
chewy: 3,
leia: 1
}

Consulting the table above, we know that the starting accumulator needs to be an empty object. This means that each iteration over the array returns an object as well.

1
2
3
4
const frequencyArray = ['luke', 'anakin', 'chewy', 'luke', 'chewy', 'princess', 'leia', 'chewy'];
frequencyArray.reduce(function(resultantObject, nextWord) {
}, {})

Every step of the higher order function needs to examine the next word and determine if it has been seen before. If it has not, then we need to add an entry to the accumulator and return it. If it has, then we need to increment the existing entry inside the accumulator object and return it.

1
2
3
4
5
6
7
8
9
10
11
frequencyArray.reduce(function(resultantObject, nextWord) {
// If the word is not in the object
if (!resultantObject.hasOwnProperty(nextWord)) {
resultantObject[nextWord] = 1 //Set it to 1 since we have seen the word before
return resultantObject
} else {
// Otherwise increment the counter for that word inside the resulting object
resultantObject[nextWord]++
return resultantObject
}
}, {})

Just like the first example, every code path returns the type of the accumulator.

Merging Objects

Let’s end with something a little more complicated.

Suppose we want to implement a merge function to combines several objects into a single object. We’re going to implement something really similar to Object.assign. Here are some examples.

1
2
3
4
5
6
merge([ { a: 4, b: 3 }, { c:10 } ]) --> { a: 4, b: 3, c: 10 }
merge([ { a: 4, b: 3 }, { c:4 }, {f: 7} ]) --> { a: 4, b: 3, c: 4, f: 7}
// Duplicate keys take the value of the latter object
merge([ { a: 4, b: 3 }, { c:4 }, {a: 7} ]) --> { a: 7, b: 3, c: 4}

We know that the return value from this operation is an object, so that will be our starting accumulator. Every iteration of the array will produce a new object that contains the merging of the previous accumulator and the next object.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// This merge function will take in two objects and output a new object
// containing both the keys and values from the two input objects.
function mergeTwo(obj1, obj2) {
const newObj = {}
for (let key in obj1) {
newObj[key] = obj1[key]
}
for (let key in obj2) {
newObj[key] = obj2[key]
}
return newObj
}
function merge(arr) {
return arr.reduce(function(resultantObj, nextObj) {
return mergeTwo(resultantObj, nextObj)
}, {})
}

Summary

To summarize

  1. Always start by asking what the type of your output is. The type of your accumulator will follow by using the mapping table.
  2. The logic inside your higher order function should assemble your resulting accumulator piecemeal. Since the return value of the previous iteration is the input to the next iteration, every code path must return a value
  3. Furthermore, unless you want a horrible experience, every code path MUST return a value whose type is the same as that of your initial accumulator.

Conclusion

At the end of the day, just about everything you can do with reduce can be done with some combination of map and fitler and perhaps another functional method. And the alternative solution is almost always simpler and more readable. So, in practice, you probably don’t want to use reduce that often. With that said, reduce is a building block on which every other functional method can be built, and we will explore this unique trait in my next blog post.

Leveraging Immutability in React

React has taken the web development community by a storm, and with it functional programming concepts have embedded themselves in the mainstream. One common statement you will often read is that all state in React should be immutable, and this practice is justified as necessary for performance reasons. This statement is entirely true, but it only tells half the truth. Immutability alone will not yield any performance gains in React (it’ll actually make things slower).

The Quick Answer

You can reap the gains from immutable state in React if you inherit from React.PureComponent instead of React.Component

1
class MyComponent extends PureComponent {...}

A Broader Perspective

So why does the above code work? That requires a little bit of knowledge about the rendering process.

Virtual DOM

A component’s render function returns a tree of React elements, also known as the virtual DOM. This data structure is a representation of the browser’s DOM that React can manipulate in a performant way. In essence, it’s a 1-1 mapping to the browser DOM.

Whenever state or props change, instead of updating the browser DOM directly, React will call the render function of the updated component, getting its new virtual DOM, and it will compare that virtual DOM to the previous render tree (aka the old virtual DOM). This comparison process (known as reconciliation) allows it to identify the minimum set of changes that need to be made to the browser DOM, allowing for for massive performance gains. If the two trees are identical, then no changes need to be made to the UI.

Should a Component Re-render

When setState is called on a component, that component (and all of the components it renders) go through a two step process for React determine if the UI should update.

Before the virtual DOM is rendered, React first runs a component’s shouldComponentUpdate lifecycle method

shouldComponentUpdate

It is your choice to implement this method. It takes as parameters the new state and props and should return a boolean indicating whether the component should re-render. React will only proceed with re-rendering if you return true from this function. By default, shouldComponentUpdate always returns true.

Let’s look at a simple example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class Counter extends React.Component() {
constructor() {
this.state = {
counter: 0
}
}
shouldComponentUpdate(nextProps, nextState) {
// We are comparing the state before setState
// is called to the state after it is called.
// This component will only re-render if
// state.counter changes
return this.state.counter !== nextState.counter
}
render() {
return <span>{this.state.counter}</span>
}
}

Currently Counter will only re-render if state.counter changes. Suppose we remove the shouldComponentUpdate above, is there another circumstance where Counter‘ re-renders? It would re-render if a component that renders Counter re-renders! It does not matter that Counter‘s render function does not depend on props!

So, when shouldComponentUpdate returns true, React begins the second part of the re-render process, the part where it invokes the render function, generating its virtual DOM. It will then compare that render tree to the old virtual DOM. In the scenario above, these two trees will always be identical (after all, the component doesn’t even have a way to change state), and the entire operation will a be waste CPU cycles.

If shouldComponentUpdate returns false, we inform React that a re-render will not be necessary, sparing it an unnecessary to call a component’s render function and the comparison of the render trees.

React.pureComponent

Writing shouldComponentUpdate is sometimes a tedious task, and React supplies a helper to handle a very common shouldComponentUpdate scenario. I mentioned React.pureComponent above.

1
class MyComponent extends PureComponent {...}

React.pureComponent is a handy class that we can inherit from that implements shouldComponentUpdate as a shallow comparison across the old props and new props (or the old state and new state). If there are no shallow differences between these objects, then shouldComponentUpdate will return false, and the virtual DOM’s re-rendering process can be skipped entirely, resulting in potentially enormous performance savings.

Why Does Immutability Matter?

Immutability matters because React.pureComponent is going to do shallow comparisons against the prop and state objects. And, as you hopefully know, a shallow comparison’s referential equality checks will only yield the expected results if immutability is respected.

If you are not properly respecting immutability, then you are going to run into scenarios where your components fail to re-render when they actually should. A failure to respect immutability will cause a shallow comparison to return false (meaning shouldComponentUpdate returns false) because the old props and new props (or the old state and new state) reference the same object in memory, despite the fact that props/state changed.

When used carefully however, React.pureComponent can yield enormous performance improvements at almost no cost. All you need to do is inherit your components from React.pureComponent instead of React.Component and respect immutability when changing state.

I hope you found this article helpful. Please feel free to email me to reach out if you have questions.