A client approached me with a puzzling problem:
At Fraight, we have an omnisearch interface backed by an Elasticsearch datastore. The interface allows users yo type a freetext query and get a list of database records sorted by relevancy. At it’s core, this is a simple problem: if the user types in
Joe, return all people whose name contains the word
Joe. And indeed, returning all the
Joe's in the system is trivial; the problem is that we worked hundreds, possibly even thousands of
Joes. How do we identify the particular
Joe that we care about?
When I worked at Epic, we had a similar problem. We had a search interface that allowed us to look up patients. Unfortunately, our full names are not as unique as we like to believe, and a simple query for
Luke Smith would surely bring the system to a halt. Epic solved this problem by providing additional fields. That’s why (along with HIPPA reasons), when you call the doctor, they might ask for your birthdate or your address; this additional identifying information is used to pare down the search results. This solution is slow, cumbersome and was deemed wholly inadequate for us.
We settled on two attributes that should influence the score of a particular record:
- How often we interact with an entity
- How recently we’ve interacted with an entity
It’s important to know that most of the trucking organizations in our system have been worked with minimally. We needed a remove this noise from the search results. Phrased differently, if we regularly work with a particular
Great America Truckers more than the other 1000
Great America Truckers, we want our partner to appear higher in the search results.
Similarly, if we have recently interacted with an organization, there’s a good chance we will need to interact with them in the future. For example, if we’ve just initiated a conversation with a new business partner, there’s a good chance we will continue interacting with them regularly in the near-term. It’s important, however, to consider the time since our last interaction. If we interacted with someone yesterday or the day before, it’s very important for them to be ranked higher than an organization we interacted with 10 days ago.
Both of these solutions require populating our database with information regarding how recently we’ve interacted with particular entities. Fortunately, all of this can be done during the data ingestion phase when records are loaded in Elasticsearch. The only requirement is making sure to keep the information in Elasticsearch up to date with our PostgreSQL database
The one challenge with this solution is that it requires augmenting our documents with special data regarding total interactions and recency of interactions. The huge advantage, however, is that it works seamlessly. We don’t require there to be any special configuration when we begin working with a new partner.
Elasticsearch provides function score queries that allow you to modify Elasticsearch’s calculated score. Their documentation provides a really simple example explaining boost the relevancy of a document that has many
This code says that every document’s score will be
sqrt(1.2 * doc['likes'].value)
We used a particular kind of Function Score Query known as a Script Score.
We wanted our ultimate score to be a function of three values: the score returned from elasticsearch, the days since our last interaction, and our total interactions. Initially, our function looked something like this
newscore = _score doc[‘interactions’]
We found that Elasticsearch’s relevancy score was being overwhelmed by parters with a high number of interactions, so much so that given a search term of
nick drane, we might return
nick smith, simply because we’ve worked with nick smith more.
We needed to place an upperbound of the potential contribution of the total interactions. After initially looking at logistic functions and other overly complicated strategies, we settled on a simple step function. If the number of interactions is greater than 25, then we multiply Elasticsearch’s relevancy score by 2, otherwise we multiply it 1.
The frequency contribution’s function was similarly simple.
(could be some density/decay function that looks at density of interactions over time)
Keeping the extra attributes up to date
We did try to combine two guassian curves but couldn’t get it working
Sometimes was built using a
query_string query against the
_all field. Sometimes the returned results were good, but most of the time they fell short. Their problem was simple: how do we ensure that the
select based on recent claim
select based on scheduled appointment
select based on upcoming event
specify table to search