Pages tagged levenshtein:

Dealing with Duplicate Person Data - Proud to Use Perl
http://proudtouseperl.com/2009/04/dealing-with-duplicate-person-data.html

I've recently been working on a fairly large project that that has contact information for almost 2 million people. These records contain details for both online and offline actions. Since the data can come from multiple sources there exist many duplicate records. Duplicate records mean more processing for our code, more storage space and more hassle for our clients who have to deal with these duplicates. All in all, bad things to leave lying around. In this article we'll look at some strategies that I used to identify and remove these duplicates. All code in this article are samples, and we'll leave the task of assembling them into a final working program up to the reader. CPAN is your Friend Like all good Perl projects, we will make heavy use of the CPAN. It makes our lives so much easier and every day I'm more in awe at the quality and bredth of solutions I find there. For this project we'll be using Text::LevenshteinXS, Lingua::EN::Nickname and Parallel::ForkManager. What is a Du
Funny to see people still using perl these days but great example