Follow by Email

Nov 27, 2008

Not to forget to do data cleansing before modeling

Commit to memory:
1) Values are within the domain range – need to eliminate illegal or out-of-range values.
a. Example 1: A variable like ‘gender’ would expected to have only two value; either ‘0’, ‘1’ or ‘Male’, ‘Female’. Check through frequency tables whether values are more than expected.
b. Example 2: Variables like ‘Date-of-Birth’ or ‘Height in Inches’ should be within reasonable limits.
c. Example 3: Levels of Education, Customer Category should not have more than defined levels or categories.

2) Uniqueness of the data – check for duplicate records across the data.
a. Following examples might be due to programming, typo and phonetic errors need to be corrected for uniqueness. City name and STD code should correspond, correcting misspelling of Chennai city.
‘Customer ID = 1000089’ ‘Customer Name = John Smith’
‘Customer ID = 1000089’ ‘Customer Name = “Peter Miller’.
‘City=Chennai, STDCODE=044’ ‘City=Chennai, STDCODE=055’.

‘City=Chennai, City=chenai, City=CHHENNAI, CITY=Madras’.

‘Customer Name= VIVEKANAND’ ‘Customer Name=VIVEK ANAND’.

b. Following examples must be treated properly as either “wrong or misfiled” or “missing” values, so that uniqueness of the field is maintained;
‘phone=000-00000000’ ‘phone=999-99999999’.

‘phone=000-23#45*56’ ‘phone=###-********’.

3) Wrong References – Reference may be defined but wrong entry or record exits, need to be corrected or cross-checked.
a. Examples: Reference ZIP may be defined but does not belong to Chennai city.
‘City=Chennai, STDCODE=044, ZIP=600053’
‘City=Chennai, STDCODE=044, ZIP=600653’
4) Correspondent values – values like age should correspond to given date-of-birth (DOB).
a. Example: In the below example given DOB and age are not correct.

‘DOB: 10-10-1981, Age of customer = 37 years’.

No comments: