Nov 27, 2008

Not to forget to do data cleansing before modeling

Commit to memory:
1) Values are within the domain range – need to eliminate illegal or out-of-range values.
a. Example 1: A variable like ‘gender’ would expected to have only two value; either ‘0’, ‘1’ or ‘Male’, ‘Female’. Check through frequency tables whether values are more than expected.
b. Example 2: Variables like ‘Date-of-Birth’ or ‘Height in Inches’ should be within reasonable limits.
c. Example 3: Levels of Education, Customer Category should not have more than defined levels or categories.

2) Uniqueness of the data – check for duplicate records across the data.
a. Following examples might be due to programming, typo and phonetic errors need to be corrected for uniqueness. City name and STD code should correspond, correcting misspelling of Chennai city.
‘Customer ID = 1000089’ ‘Customer Name = John Smith’
‘Customer ID = 1000089’ ‘Customer Name = “Peter Miller’.
‘City=Chennai, STDCODE=044’ ‘City=Chennai, STDCODE=055’.

‘City=Chennai, City=chenai, City=CHHENNAI, CITY=Madras’.

‘Customer Name= VIVEKANAND’ ‘Customer Name=VIVEK ANAND’.

b. Following examples must be treated properly as either “wrong or misfiled” or “missing” values, so that uniqueness of the field is maintained;
‘phone=000-00000000’ ‘phone=999-99999999’.

‘phone=000-23#45*56’ ‘phone=###-********’.

3) Wrong References – Reference may be defined but wrong entry or record exits, need to be corrected or cross-checked.
a. Examples: Reference ZIP may be defined but does not belong to Chennai city.
‘City=Chennai, STDCODE=044, ZIP=600053’
‘City=Chennai, STDCODE=044, ZIP=600653’
4) Correspondent values – values like age should correspond to given date-of-birth (DOB).
a. Example: In the below example given DOB and age are not correct.

‘DOB: 10-10-1981, Age of customer = 37 years’.

Nov 25, 2008

First step in model building - Data Reading.

Commit to memory:
1) Variables has to be read in appropriate format, namely:
a. Numeric
b. Character
c. Date
d. Currency (Dollar) or Custom (Comma)
e. Length:
i. Appropriate width and decimals for numeric’s
ii. Appropriate width for character’s
2) Appropriate order & labelling for ‘Ordered’ categorical variables, since order is important and value driven.
a. Example 1: Strongly agree, somewhat agree, neither agree nor Disagree.
b. Example 2: Ratings viz., 0, 1, 2, 3 etc being 0 as worst and 3 as very-good.
c. Example 3: If exists arithmetic operations viz., greater than or less than.
3) Appropriate labelling (description) for ‘Nominal’ categorical variables when given in numerics.
a. Example 1: If Gender given as 1 label whether it is ‘Male’ or ‘Female’.
b. Example 2: Similarly for Brand, Ethnicity etc.
4) Appropriate labelling (scale description) for continuous variables.
a. Example 1: Age of a product/service – whether in weeks, months, quarterly, half-yearly etc.
b. Example 2: Quantity of a product – whether in units or volumes (Pounds, Kilograms, Litres etc.).