Wednesday, August 5, 2009

DQ Alert: Easy Savings by Removing Dups


Poor data quality costs over $600 billion annually as per TDWI surveys and studies.

Whether you are the VP looking over your latest sales, the secretary compiling a mailing list, the data quality analyst rummaging through data sets, the business analyst working on a data integration project, or an accountant going over the projected budget.

Whether you are in a large Fortune 500 company, in the government, or a small community association you will benefit.

Duplicate records

Duplicate records of customers cause discontented customers and multiple mail-outs. How is it possible to have duplicate records?


-- Records are manually entered twice;
-- Processes create record twice;
-- Participants registered under multiple names;
-- Participants registered at multiple locals.

So before you send out a mailing list to your community members, or potential program participants for marketing campaigns or program sign-ups for this year’s sports, arts, sales, and/or membership drive seasons:

-- 1. Sort that list by name. Why because you may have the child on the list twice, or three times; after sorting by name;

-- 2. Sort by address, you may have the same family household multiple times because one the parents registered under their name, and/or siblings are registered with your organization as well.

Don’t know how to sort…here’s a good way to start if you’re using Microsoft Excel…highlight your records and click on this little icon , it’s easy as breaking eggs.


Or

New SQL type programming here’s a simple query that will identify duplicate records.


SELECT attribute1, attribute2, attribute3, attributen… count(*)
FROM
dbo.tablex
GROUP BY attribute1, attribute2, attribute3, attributen…
HAVING (COUNT(*) > 1)



I hope this helps anyone working with customer lists..









1 comment:

  1. Excellent post Daniel!

    A simple technique for implementing an alert for one of the most common data quality problems.

    Obviously there are more insidious variations of potential duplicates that this alert would not catch, but at least it would help identify some of the most obvious (and embarrassing) ones.

    Best Regards...

    Jim

    ReplyDelete