Fuzzy Matching - Smart Way of Finding Similar Names Using FuzzyWuzzy

By Cheuk Ting Ho

Elevator Pitch

Ever encounter a tricky situation of knowing there’s names that are the same, but matching strings straight away leads you no where? All you need is FuzzyWuzzy, a simple but powerful open-source Python library and some wit. This talk will demonstrate how to efficiently fuzzy match company names.

Description

Matching strings should be one of the first natural language processing problem that human encounter since we start use computer to handle data. Unlike numerical value which has an exact logic to compare them, it is very hard to say how alike two strings are for a computer. One may compare them character by character and have an idea of how many characters in the pair of stings are the same. Unfortunately in most application we need computer to perceive strings like we do and therefore we have to use fuzzy matching. Fuzzy matching on names is never straight forward though, the definition of how “difference” of two names are really depends case by case. For example with restaurant names, matching of words like “cafe” “bar” and “restaurant” are consider less valuable then matching of some other less common words. Also, do we consider company names that matches partly (like “Happy Unicorn company” and Happy Unicorn co.”) are the same?

In the first half of the talk Levenshtein Distance, a measure of the similarity between two strings, will be explained. Different functions in FuzzyWuzzy like “partial_ratio” and “token_sort_ratio” will also be explored and compared for difference. It is very important to understand our tool and choose the right one for our task. Then in the second half, we will start tackling the example problem: matching company names, we will show that besides using FuzzyWuzzy, we have to also handle problem like finding and avoid matching of common words and speeding up the matching process by grouping the names. By combining all tricks and techniques that we demonstrate, we will also evaluate how efficient this method is and the advantage of using this method.

This talk is for people in all level of Python experience who would like to learn a trick or two and would like to be able to solve similar problems in the future. Theory of how the library works will be explained and It is easy to be pick up even for beginners.

Notes

The library shown in the talk is an open source library which is available on Github: https://github.com/seatgeek/fuzzywuzzy

Source code available on Github: https://github.com/Cheukting/fuzzy-match-company-name

Slides (not finalized): http://slides.com/cheukting_ho/fuzzy-matching