Coastal Econometrician Views: data cleansing

Showing posts with label data cleansing. Show all posts

Aug 29, 2017

Clean or shorten Column names while importing the data itself

When it comes to clumsy column headers namely., wide ones with spaces and special characters, I see many get panic and change the headers in the source file, which is an awkward option given variety of alternatives that exist in R for handling them.

One easy handling of such scenarios is using library(janitor), as name suggested can be employed for cleaning and maintaining. Janitor has function by name clean_names() which can be useful while directly importing the data itself as show in the below example:
" library(janitor); newdataobject <- read.csv("yourcsvfilewithpath.csv", header=T) %>% clean_names() "

Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri

May 22, 2009

Data Preparation - An important step people always forget.

Welcome back to practice of good econometrics, hope my post are helping you a lot. I believe in less and quality, so I post seldom.

Today's topic is "Data Preparation".

Data Preparation is different from data cleansing, often people use these words interchangeably. We have already learn what do in data cleansing in my earlier posts, now let’s look at what is data preparation and what we do here.

Data Preparation can be said as an understanding of the data that allows us to build the right model, right first time. It helps us in understanding the information enfolded in the data, can be between two independent variables and dependent and independent variables. Once, relationship is identified and traceable, then the predictor variable is re-expressed to reflect the uncovered relationship, and consequently tested for inclusion into the model.

First and prior methods of data preparation are “Correlation analysis” and “Scatter Plots”.

Correlation Analysis:

Correlation analysis provides “correlation coefficient” which is a measure of the strength of the linear-relationship between two variables.

Guidelines for Correlation Coefficient.

Zero (0) indicates no linear relationship.
+1 indicates a perfect positive linear relationship: as one variable increases in its values, the other variable also increases in its values via an exact linear rule.
-1 indicates a perfect negative linear relationship: as one variable increases in its values, the other variable also decreases in its values via an exact linear rule.
Values between 0 and 0.3 (0 and -0.3) indicate a weak positive (negative) linear relationship.
Values between 0.3 and 0.7 (-0.3 and -0.7) indicate a moderate positive (negative) linear relationship.
Values between 0.7 and 1.0 (-0.7 and -1) indicate a strong positive (negative) linear relationship.

Caution with Correlation Coefficient.
a) Correlation coefficient is a reliable measure only if the underlying variables exhibit linear relationship. If the underlying relationship is known to non-linear then Correlation coefficient misleads or questionable.
b) Hence, one needs to test the linearity assumption of the correlation coefficient, which can be done a Scatter plots.

Scatter Plot Analysis:

a. Scatter plot is a graph which represents mapping of the paired points (Xi, Yi).

b. If the scatter of points appears to be a straight-line, then the linear assumption is satisfied and correlation coefficient provides a meaningful measure.

c. If not then linear assumption is not satisfied and correlation coefficient is questionable.

d. Hence, scatter plots are desirable.

Nov 27, 2008

Not to forget to do data cleansing before modeling

Commit to memory:
1) Values are within the domain range – need to eliminate illegal or out-of-range values.
a. Example 1: A variable like ‘gender’ would expected to have only two value; either ‘0’, ‘1’ or ‘Male’, ‘Female’. Check through frequency tables whether values are more than expected.
b. Example 2: Variables like ‘Date-of-Birth’ or ‘Height in Inches’ should be within reasonable limits.
c. Example 3: Levels of Education, Customer Category should not have more than defined levels or categories.

2) Uniqueness of the data – check for duplicate records across the data.
a. Following examples might be due to programming, typo and phonetic errors need to be corrected for uniqueness. City name and STD code should correspond, correcting misspelling of Chennai city.
‘Customer ID = 1000089’ ‘Customer Name = John Smith’
‘Customer ID = 1000089’ ‘Customer Name = “Peter Miller’.
‘City=Chennai, STDCODE=044’ ‘City=Chennai, STDCODE=055’.

‘City=Chennai, City=chenai, City=CHHENNAI, CITY=Madras’.

‘Customer Name= VIVEKANAND’ ‘Customer Name=VIVEK ANAND’.

b. Following examples must be treated properly as either “wrong or misfiled” or “missing” values, so that uniqueness of the field is maintained;
‘phone=000-00000000’ ‘phone=999-99999999’.

‘phone=000-23#45*56’ ‘phone=###-********’.

3) Wrong References – Reference may be defined but wrong entry or record exits, need to be corrected or cross-checked.
a. Examples: Reference ZIP may be defined but does not belong to Chennai city.
‘City=Chennai, STDCODE=044, ZIP=600053’
‘City=Chennai, STDCODE=044, ZIP=600653’
4) Correspondent values – values like age should correspond to given date-of-birth (DOB).
a. Example: In the below example given DOB and age are not correct.

‘DOB: 10-10-1981, Age of customer = 37 years’.