May 22, 2009

Data Preparation - An important step people always forget.

Welcome back to practice of good econometrics, hope my post are helping you a lot. I believe in less and quality, so I post seldom.

Today's topic is "Data Preparation".


Data Preparation is different from data cleansing, often people use these words interchangeably. We have already learn what do in data cleansing in my earlier posts, now let’s look at what is data preparation and what we do here.

Data Preparation can be said as an understanding of the data that allows us to build the right model, right first time. It helps us in understanding the information enfolded in the data, can be between two independent variables and dependent and independent variables. Once, relationship is identified and traceable, then the predictor variable is re-expressed to reflect the uncovered relationship, and consequently tested for inclusion into the model.

First and prior methods of data preparation are “Correlation analysis” and “Scatter Plots”.


  1. Correlation Analysis:
  • Correlation analysis provides “correlation coefficient” which is a measure of the strength of the linear-relationship between two variables.
  • Guidelines for Correlation Coefficient.
  1. Zero (0) indicates no linear relationship.
  2. +1 indicates a perfect positive linear relationship: as one variable increases in its values, the other variable also increases in its values via an exact linear rule.
  3. -1 indicates a perfect negative linear relationship: as one variable increases in its values, the other variable also decreases in its values via an exact linear rule.
  4. Values between 0 and 0.3 (0 and -0.3) indicate a weak positive (negative) linear relationship.
  5. Values between 0.3 and 0.7 (-0.3 and -0.7) indicate a moderate positive (negative) linear relationship.
  6. Values between 0.7 and 1.0 (-0.7 and -1) indicate a strong positive (negative) linear relationship.
  • Caution with Correlation Coefficient.
  • a) Correlation coefficient is a reliable measure only if the underlying variables exhibit linear relationship. If the underlying relationship is known to non-linear then Correlation coefficient misleads or questionable.
  • b) Hence, one needs to test the linearity assumption of the correlation coefficient, which can be done a Scatter plots.



Scatter Plot Analysis:


a. Scatter plot is a graph which represents mapping of the paired points (Xi, Yi).

b. If the scatter of points appears to be a straight-line, then the linear assumption is satisfied and correlation coefficient provides a meaningful measure.

c. If not then linear assumption is not satisfied and correlation coefficient is questionable.

d. Hence, scatter plots are desirable.