Fuzzy Matching Made Easy
The ability to match data that is similar but not exactly the same can improve data quality, usability, and accuracy. With the fuzzy matching algorithm, it can happen fast.
Fuzzy Matching is a record matching technique that is used to locate non-exact matches, such as an abbreviation or misspelled record. Fuzzy matching is commonly deployed in search engines to produce better search results. Also commonly used to link disparate data sets together that don’t share common keys for more advanced, comprehensive data analysis and insights.
Data enthusiasts: Keep reading. We’ll take a behind-the-scenes look at how fuzzy matching was used to merge Medicaid data with a Tobacco Quitline data to identify the Medicaid recipients who are not already enrolled in program to increase participation.
Exploratory data analysis was conducted to provide insights around two distinct datasets. These discoveries are confidential, so only statistics will be shared. The workstream is created to match misspelled names and addresses across the two datasets. The research and development of the workstream was conducted in Jupyter Notebook, a web-based interactive computing platform.
There are several fuzzy matching algorithms available that will produce fuzzy matching outputs. The issue is with the speed of the algorithm, not the result.
The two algorithms employed in this workstream are Levenshtein and Jaro-Winkler. While the two are widely known in the tech industry, they are not commonly used solutions due to the speed of the algorithms in Python. Levenshtein, which was used to match the first and last names of the two datasets, focuses on the number of edits (i.e. insertions, deletions, or substitutions) needed to convert one string to the other. Jaro-Winkler, on the other hand, is used to match addresses of the datasets. This is a measure of characters in common. Jaro-Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. In our case, the emphasis on the start of the address would eliminate “bad” matches (e.g., address on the same street with different address numbers).
The algorithms produce great matches, but as previously noted, the algorithms written in Python are extremely slow. In the simplest terms, cleaning and transforming entries can take hours to complete, but the implementation of a fuzzy matching solution accelerates the process and brings the benefit of greater data quality and accuracy, which is useful across industries, helping to save time and money.
Mohamad Quteifan is an analyst in Avaap's data analytics practice. Mohamad is an experienced professional who delivers valuable insights via data analytics and advanced data-driven methods.