Follow by Email

Sep 8, 2017

Hard-nosed Indian Data Scientist Gospel Series - Part 2 : Certificate (or Degree) Mania

This is second in series, first is here.

Again, whole past decade before & after recession seems to be & seeming to be revolving around a mania called certificate or degree’s around some topic / tool. Let it be subject / concept namely., Analytics or Machine Learning or Data Science etc. and tool / technology namely., SAS or SPSS or R or Python etc. (where price of such unequal to (s) ranged from 0,000’s to 000,000’s).



This always reminded and reminds me that most of marketers duped aspirants around data science by hiding its important characteristic namely., “multi-disciplinary one”, that led to ending up with partial learning or incomplete or incompetent learning which couldn’t cater industry needs.


Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.

Find more about author at http://in.linkedin.com/in/pradeepmavuluri

Aug 29, 2017

Clean or shorten Column names while importing the data itself

When it comes to clumsy column headers namely., wide ones with spaces and special characters, I see many get panic and change the headers in the source file, which is an awkward option given variety of alternatives that exist in R for handling them.






One easy handling of such scenarios is using library(janitor), as name suggested can be employed for cleaning and maintaining. Janitor has function by name clean_names() which can be useful while directly importing the data itself as show in the below example:
" library(janitor); newdataobject <- read.csv("yourcsvfilewithpath.csv", header=T) %>% clean_names() " 

Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri

Aug 24, 2017

Hard-nosed Indian Data Scientist Gospel Series - Part 1 : Incertitude around Tools and Technologies


Before recession a commercial tool was popular in the country, hence, uncertainty around tools and technology was not much; however, after recession, incertitude (i.e. uncertainty) around tools and technology have pre-occupied and occupying data science learning, delivery and deployment.

When python was continuing as general programming language, R was the left out best choice (became more popular with the advent of an IDE i.e. RStudio) and author still see its popularity among non-programming background (i.e. other than computer scientists) data scientists. Yet, author notices in local meet ups, panel discussions, webinars, still, a clarity on which is better from aspirants towards the data sicence as a everyday interest as shown in below image.

Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri

Jul 11, 2016

Big Data Insights: Tale of IT Investments and Returns

Once again, this post brings forth to the audience, a predictive analytical insight from huge volumes of information technology security data belonging to two fortune 500 companies (more or less having similar characteristics). Going to a quick background of the study, here, analytical interest was to know how both organizations understood and invested in their IT Security over a period of time and what was their ROI (Return on Investment)?

With respect to my earlier Big Data Insight post, I got many queries about data, hence, herein, I am publishing data used for plotting purposes, for quick play in R. As, just mentioned above, volumes were huge, and all initial volumes were processed in Apache Spark stack in cloud environment. Now, as usual, below analysis has been carried out using R Programming Language components viz., R-3.3.1, RStudio (favorite IDE), ggplot2 package for plotting.

Now, lets understand the below plot, x-axis has 'year' as measure that ranges from 1999 to 2015, y-axis has numbers observed for major threats and IT Security employees at both the organizations (Org). If one starts looking at the year 2000, it is evident that Org A has more threats than Org B, however, both organizations had their number of IT Security employees around 10 (Org A have only few more employees compared to Org B, also, it is clear that Org B has one more employee than Org A in earlier year 1999). But, Org A for next 2-3 years has increased its IT Security employess to 20 in number, where as Org B has more or less maintained same number of employees for next set of 10 years. As a result, Org B has reached a stage wherein their number of major threats exploded and went beyond existing teams control, whereas, Org A initial invesment in employees worked out better for them and their number of major threats were more or less either stable or decreased over a period of time (don't forget, here acheiving zero is impossible given new technologies, applications coming every year).

Data employed for the plot:
dput(IT_threats_returns)
structure(list(Year = c(1999, 1999, 1999, 1999, 2000, 2000, 2000, 
2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002, 2003, 2003, 
2003, 2003, 2004, 2004, 2004, 2004, 2005, 2005, 2005, 2005, 2006, 
2006, 2006, 2006, 2007, 2007, 2007, 2007, 2008, 2008, 2008, 2008, 
2009, 2009, 2009, 2009, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 
2011, 2012, 2012, 2012, 2012, 2013, 2013, 2013, 2013, 2014, 2014, 
2014, 2014, 2015, 2015, 2015, 2015), Numeric_Value = c(28, 11, 
9, 10, 36, 26, 13, 7, 28, 26, 17, 9, 26, 29, 21, 10, 32, 21, 
19, 9, 25, 34, 19, 10, 30, 35, 20, 10, 22, 27, 19, 10, 31, 42, 
19, 11, 29, 47, 19, 11, 28, 45, 22, 11, 25, 55, 23, 13, 30, 51, 
21, 14, 25, 49, 22, 13, 32, 60, 22, 19, 25, 53, 25, 24, 19, 49, 
25, 29), Desc = c("Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps"
)), .Names = c("Year", "Numeric_Value", "Desc"), row.names = c(NA, 
68L), class = "data.frame")

# code used for plotting
library(ggplot2)
p <- ggplot(IT_threats_returns, aes(x=Year, y=Numeric_Value, col=Desc)) + geom_line(linetype=5, size=1) + theme_light() + theme(legend.position="none") + ylab("") + xlab("")
p + annotate("text", x=c(2012, 2012, 2004.5, 2012.5), y=c(47,34,18,10.5), label=c("   `Org_B` : No_of_Major_Threats", "   `Org_A` : No_of_Major_Threats", "   `Org_A` : No_of_IT_Security_Emps", "   `Org_B` : No_of_IT_Security_Emps"), col=c("#C77CFF", "#7CAE00", "#F8766D", "#00BFC4"))
Created by Pretty R at inside-R.org

Jul 1, 2016

Indian_IT_Cos_HR_Analytics_Hurdle

There has always been a question to me time-to-time (because of my earlier experience with developing HR platform for few big fortune clients), on “why Indian IT companies are not towards advanced HR analytics?”


Below is true for more than 90% of the Indian IT companies, 'since animal representing management cannot bypass an important layer, a big animal representing employees, which is a pseudo big (*), hence, jumping is almost impossible for implementing all those insights brought out or meant for employees'. Herein, one might guess a missing component which most of employees feel, of not much use in Indian IT companies context …………….?



Dec 16, 2015

Big Data Insights - IT Support Log Analysis

This post brings forth to the audience, few glimpses (strictly) of insights that were obtained from a case of how predictive analytic's helped a fortune 1000 client to unlock the value in their huge log files of the IT Support system. Going to quick background, a large organization was interested in value added insights (actionable ones) from thousands of records logged in the past, as they saw both expense increase at no higher productivity.

As, most of us know in these business scenarios end-users will be much interested in out-of-knowledge, strange and unusual things that may not be captured from regular reports. Hence, here data scientist job not only ends at finding un-routine insights, but, also needs to do a deeper dig for its root cause and suggest best possible actions for immediate remedy (knowledge of domain or other best practices in industry will help a lot). Further, as mentioned earlier, only few of those has been shown/discussed here and all the analysis has been carried out using R Programming Language components viz., R-3.2.2RStudio (favorite IDE)ggplot2 package for plotting.

The first graph (below one) is a time series calendar heat map adopted from Paul Bleicher, shows us the number of tickets raised day-wise over every week of each month for the last year (green and its light shades represent less numbers, where as red and its shades represent higher numbers).



Herein, if one carefully observe the above graph, it will be very evident for us that, except for the month of April & December, all other months have sudden increase in the number of tickets raised over last Saturday's and Sunday's; and this was more clearly visible at Quarter ends of March, June, September (also at November which is not a Quarter end). One can think of this as unusual behavior as numbers raising at non-working days. Before, going into further details, lets also look at one more graph (below), which depicts solved duration in minutes on x-axis and their respective time taken through a horizontal time line plot.

The above solved duration plot show us that out of all records analyzed 71.87% belong to "Request for Information" category and they have been solved within few minutes of tickets raised (that's why we cannot see a line plot for this category as compared to others). So, what's happened here actually was a kind of spoof, because of lack of automation in their systems. In simple words, it was found that there doesn't exists a proper documentation/guidance for many of applications they were using; such situation was taken as advantage for increasing the number of tickets (i.e. nothing but, pushing for more tickets even for basic information in the month ends and quarter ends, which resulted in month end openings which in turn forced them to close immediately). Discussed one here is one of those among many which has been presented with possible immediate remedies which can be easily actionable.

Visual Summarization:





Dec 1, 2015

Average Expenses for TV across states of USA

This post makes an attempt to depict the averages spent across the states towards their TV channel expenses for a big size country (USA). Though it has been developed using sample data belonging to a particular service provider; this post depicts its interest in regional differences in average spent on said service across the country. Herein, I would like to bring to your notice that economic importance of some USA states being notably better connected with multiple service providers and due to geographical location and population density, results/insights may be specific to this sample data. All the analysis has been carried out using R Programming Language components (R-3.2.2, RStudio (favorite IDE), ggplot2 for mapping).

Average Amount Spent ($) on TV by States in 2015 (till Nov):

The figure below depicts the map of 48 states of USA (for which the data was available) wherein it shows the average TV expense by state for the year 2015 which was available till end of the November; with five different colors (i.e. five different intervals of average spent). 


As it is evident from above map that for given sample North-East (region) states region has highest averages spent on TV. Next best averages (orange color) are noticed in pacific region and few Central and East states. As mentioned earlier this may be due to economic importance or due to service provider geographical spread which the employed sample data fails to take an in-depth note.

Author undertook several projects and programs towards data sciences, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail for more details.

Sep 15, 2015

"R" in Top 20 of TIOBE Index

Dear R programmers,


In this May (2015), our favorite "R" almost came to 12th position in the popular TIOBE Programming Community Index (TIOBE Index), however, it is experienced some volatility after that and couldn't move further to top 10. Currently, it holds 19th rank (for this month); wishing it retains its position in top 20 through the rest of the months of the year and hope to move quickly into top 10. Also, find below my compilation of what is R for the analytical solutions (System for Statistical Computation & Graphics), here, it doesn't mean or ignore the latest "machine learning" word, I treat it to be also part of our statistical computations.



Author (mavuluri.pradeep@gmail.com or pmavuluri@analyticaltis.com) of this post, had been using R for complete analytical solutions and educating purposes for a long time. Some 3rd party copyright materials have been reused here under fair usage approach for educating purposes. Hence, current blog post usage is restricted for educating and information purposes and fall under usual copyright usage terms.

Enjoy "R" Programming!