Data Transformation is an important aspect of any Data Matching / Integration project and should not be overlooked even when using fuzzy logic. You may have already read the sections on Phonetic & String Comparison algorithms to have seen that fuzzy logic whilst undoubtedly clever is not foolproof and the more you can improve the data quality the better the fuzzy logic will perform.
So what data needs transforming?
It might be better to ask what data doesn’t, no seriously you can add value with transformation logic to almost all aspects of your data.
Let’s start with something easy, Addresses
Here are some good examples of transformation logic which you should be using with your address data.
- Rd = Road
- St = Street
- Ave = Avenue
- hwy= Highway
- Sqr = Square
Seems pretty obvious, but be careful sometimes abbreviations can have multiple meanings, i.e. “St” could mean “Street” or “Saint”.
Let’s look at some other data, such as postal codes / Zip Codes sometimes these may be formatted in different ways for instance US Zip Codes which can be either 5 or 9 characters long may include “-” sometimes you may find that leading zero’s have been dropped.
By Standardizing the Postal Code you can better match the postcodes, especially as fuzzy logic is perhaps less useful here.
Company Names are another area that you will want to look at, with common abbreviations existing in large quantities such as Dept, Fed, HQ, Ltd, PLC, Corp, Inc, &, NHS, Gov etc..
Product Information can also benefit from transformation logic for instance in a retail product list you may have weights and measures i.e. 1KG = 1000g or 1Ltr = 1000ml, ltr = Litre etc…