Follow by Email

Aug 29, 2017

Clean or shorten Column names while importing the data itself

When it comes to clumsy column headers namely., wide ones with spaces and special characters, I see many get panic and change the headers in the source file, which is an awkward option given variety of alternatives that exist in R for handling them.






One easy handling of such scenarios is using library(janitor), as name suggested can be employed for cleaning and maintaining. Janitor has function by name clean_names() which can be useful while directly importing the data itself as show in the below example:
" library(janitor); newdataobject <- read.csv("yourcsvfilewithpath.csv", header=T) %>% clean_names() " 

Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri

2 comments:

Sanjay Mehrotra said...

Hi Pradeep, read this and will be trying out soon. Thanks.

Dulani said...

I just grabbed a set of Excel data for a project. As always the column names are a nightmare that need to be fixed before proceeding. And then, I think to myself… what was the name of that package I heard about a while ago that addresses this problem (turns to email)? Oh yeah, it was "janitor."

Here's what it looks like in action on my dataset (Australian Bureau of Statistics: 3401.0 - Overseas Arrivals and Departures, Australia, Jun 2017):

install.packages("janitor")
library(janitor)
library(openxlsx)
library(tidyverse)

before <- read.xlsx("../data/australia file.xlsx", sheet = "Data1")
names(before)

Names before:
[1] "X1"
[2] "Number.of.movements.;.Permanent.Settler.Arrivals.;"
[3] "Number.of.movements.;.Long-term.(more.than.one.year).Residents.returning.;"
[4] "Number.of.movements.;.Long-term.(more.than.one.year).Visitors.arriving.;"
[5] "Number.of.movements.;.Permanent.and.Long-term.Arrivals.;"
[6] "Number.of.movements.;.Short-term.(less.than.one.year).Residents.returning.;"
[7] "Number.of.movements.;.Short-term.(less.than.one.year).Visitors.arriving.;"
[8] "Number.of.movements.;.Short-term.(less.than.one.year).Visitors.arriving.;"
[9] "Number.of.movements.;.Short-term.(less.than.one.year).Visitors.arriving.;"
[10] "Number.of.movements.;.Total.Arrivals.;"


after <- before %>%
clean_names()

names(after)

Names after:
[1] "x1"
[2] "number_of_movements_permanent_settler_arrivals"
[3] "number_of_movements_long_term_more_than_one_year_residents_returning"
[4] "number_of_movements_long_term_more_than_one_year_visitors_arriving"
[5] "number_of_movements_permanent_and_long_term_arrivals"
[6] "number_of_movements_short_term_less_than_one_year_residents_returning"
[7] "number_of_movements_short_term_less_than_one_year_visitors_arriving"
[8] "number_of_movements_short_term_less_than_one_year_visitors_arriving_2"
[9] "number_of_movements_short_term_less_than_one_year_visitors_arriving_3"
[10] "number_of_movements_total_arrivals"

Nice… so, I save 5-10 minutes futzing with goofy names, breathe a sigh of relief and get back to work!!!