Follow by Email

Jul 11, 2016

Big Data Insights: Tale of IT Investments and Returns

Once again, this post brings forth to the audience, a predictive analytical insight from huge volumes of information technology security data belonging to two fortune 500 companies (more or less having similar characteristics). Going to a quick background of the study, here, analytical interest was to know how both organizations understood and invested in their IT Security over a period of time and what was their ROI (Return on Investment)?

With respect to my earlier Big Data Insight post, I got many queries about data, hence, herein, I am publishing data used for plotting purposes, for quick play in R. As, just mentioned above, volumes were huge, and all initial volumes were processed in Apache Spark stack in cloud environment. Now, as usual, below analysis has been carried out using R Programming Language components viz., R-3.3.1, RStudio (favorite IDE), ggplot2 package for plotting.

Now, lets understand the below plot, x-axis has 'year' as measure that ranges from 1999 to 2015, y-axis has numbers observed for major threats and IT Security employees at both the organizations (Org). If one starts looking at the year 2000, it is evident that Org A has more threats than Org B, however, both organizations had their number of IT Security employees around 10 (Org A have only few more employees compared to Org B, also, it is clear that Org B has one more employee than Org A in earlier year 1999). But, Org A for next 2-3 years has increased its IT Security employess to 20 in number, where as Org B has more or less maintained same number of employees for next set of 10 years. As a result, Org B has reached a stage wherein their number of major threats exploded and went beyond existing teams control, whereas, Org A initial invesment in employees worked out better for them and their number of major threats were more or less either stable or decreased over a period of time (don't forget, here acheiving zero is impossible given new technologies, applications coming every year).

Data employed for the plot:
dput(IT_threats_returns)
structure(list(Year = c(1999, 1999, 1999, 1999, 2000, 2000, 2000, 
2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002, 2003, 2003, 
2003, 2003, 2004, 2004, 2004, 2004, 2005, 2005, 2005, 2005, 2006, 
2006, 2006, 2006, 2007, 2007, 2007, 2007, 2008, 2008, 2008, 2008, 
2009, 2009, 2009, 2009, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 
2011, 2012, 2012, 2012, 2012, 2013, 2013, 2013, 2013, 2014, 2014, 
2014, 2014, 2015, 2015, 2015, 2015), Numeric_Value = c(28, 11, 
9, 10, 36, 26, 13, 7, 28, 26, 17, 9, 26, 29, 21, 10, 32, 21, 
19, 9, 25, 34, 19, 10, 30, 35, 20, 10, 22, 27, 19, 10, 31, 42, 
19, 11, 29, 47, 19, 11, 28, 45, 22, 11, 25, 55, 23, 13, 30, 51, 
21, 14, 25, 49, 22, 13, 32, 60, 22, 19, 25, 53, 25, 24, 19, 49, 
25, 29), Desc = c("Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps", 
"Org_A _ No_of_Major_Threats", "Org_B _ No_of_Major_Threats", 
"Org_A _ No_of_IT_Security_Emps", "Org_B _ No_of_IT_Security_Emps"
)), .Names = c("Year", "Numeric_Value", "Desc"), row.names = c(NA, 
68L), class = "data.frame")

# code used for plotting
library(ggplot2)
p <- ggplot(IT_threats_returns, aes(x=Year, y=Numeric_Value, col=Desc)) + geom_line(linetype=5, size=1) + theme_light() + theme(legend.position="none") + ylab("") + xlab("")
p + annotate("text", x=c(2012, 2012, 2004.5, 2012.5), y=c(47,34,18,10.5), label=c("   `Org_B` : No_of_Major_Threats", "   `Org_A` : No_of_Major_Threats", "   `Org_A` : No_of_IT_Security_Emps", "   `Org_B` : No_of_IT_Security_Emps"), col=c("#C77CFF", "#7CAE00", "#F8766D", "#00BFC4"))
Created by Pretty R at inside-R.org

Jul 1, 2016

Indian_IT_Cos_HR_Analytics_Hurdle

There has always been a question to me time-to-time (because of my earlier experience with developing HR platform for few big fortune clients), on “why Indian IT companies are not towards advanced HR analytics?”


Below is true for more than 90% of the Indian IT companies, 'since animal representing management cannot bypass an important layer, a big animal representing employees, which is a pseudo big (*), hence, jumping is almost impossible for implementing all those insights brought out or meant for employees'. Herein, one might guess a missing component which most of employees feel, of not much use in Indian IT companies context …………….?



Dec 16, 2015

Big Data Insights - IT Support Log Analysis

This post brings forth to the audience, few glimpses (strictly) of insights that were obtained from a case of how predictive analytic's helped a fortune 1000 client to unlock the value in their huge log files of the IT Support system. Going to quick background, a large organization was interested in value added insights (actionable ones) from thousands of records logged in the past, as they saw both expense increase at no higher productivity.

As, most of us know in these business scenarios end-users will be much interested in out-of-knowledge, strange and unusual things that may not be captured from regular reports. Hence, here data scientist job not only ends at finding un-routine insights, but, also needs to do a deeper dig for its root cause and suggest best possible actions for immediate remedy (knowledge of domain or other best practices in industry will help a lot). Further, as mentioned earlier, only few of those has been shown/discussed here and all the analysis has been carried out using R Programming Language components viz., R-3.2.2RStudio (favorite IDE)ggplot2 package for plotting.

The first graph (below one) is a time series calendar heat map adopted from Paul Bleicher, shows us the number of tickets raised day-wise over every week of each month for the last year (green and its light shades represent less numbers, where as red and its shades represent higher numbers).



Herein, if one carefully observe the above graph, it will be very evident for us that, except for the month of April & December, all other months have sudden increase in the number of tickets raised over last Saturday's and Sunday's; and this was more clearly visible at Quarter ends of March, June, September (also at November which is not a Quarter end). One can think of this as unusual behavior as numbers raising at non-working days. Before, going into further details, lets also look at one more graph (below), which depicts solved duration in minutes on x-axis and their respective time taken through a horizontal time line plot.

The above solved duration plot show us that out of all records analyzed 71.87% belong to "Request for Information" category and they have been solved within few minutes of tickets raised (that's why we cannot see a line plot for this category as compared to others). So, what's happened here actually was a kind of spoof, because of lack of automation in their systems. In simple words, it was found that there doesn't exists a proper documentation/guidance for many of applications they were using; such situation was taken as advantage for increasing the number of tickets (i.e. nothing but, pushing for more tickets even for basic information in the month ends and quarter ends, which resulted in month end openings which in turn forced them to close immediately). Discussed one here is one of those among many which has been presented with possible immediate remedies which can be easily actionable.

Visual Summarization:





Dec 1, 2015

Average Expenses for TV across states of USA

This post makes an attempt to depict the averages spent across the states towards their TV channel expenses for a big size country (USA). Though it has been developed using sample data belonging to a particular service provider; this post depicts its interest in regional differences in average spent on said service across the country. Herein, I would like to bring to your notice that economic importance of some USA states being notably better connected with multiple service providers and due to geographical location and population density, results/insights may be specific to this sample data. All the analysis has been carried out using R Programming Language components (R-3.2.2, RStudio (favorite IDE), ggplot2 for mapping).

Average Amount Spent ($) on TV by States in 2015 (till Nov):

The figure below depicts the map of 48 states of USA (for which the data was available) wherein it shows the average TV expense by state for the year 2015 which was available till end of the November; with five different colors (i.e. five different intervals of average spent). 


As it is evident from above map that for given sample North-East (region) states region has highest averages spent on TV. Next best averages (orange color) are noticed in pacific region and few Central and East states. As mentioned earlier this may be due to economic importance or due to service provider geographical spread which the employed sample data fails to take an in-depth note.

Author undertook several projects and programs towards data sciences, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail for more details.

Sep 15, 2015

"R" in Top 20 of TIOBE Index

Dear R programmers,


In this May (2015), our favorite "R" almost came to 12th position in the popular TIOBE Programming Community Index (TIOBE Index), however, it is experienced some volatility after that and couldn't move further to top 10. Currently, it holds 19th rank (for this month); wishing it retains its position in top 20 through the rest of the months of the year and hope to move quickly into top 10. Also, find below my compilation of what is R for the analytical solutions (System for Statistical Computation & Graphics), here, it doesn't mean or ignore the latest "machine learning" word, I treat it to be also part of our statistical computations.



Author (mavuluri.pradeep@gmail.com or pmavuluri@analyticaltis.com) of this post, had been using R for complete analytical solutions and educating purposes for a long time. Some 3rd party copyright materials have been reused here under fair usage approach for educating purposes. Hence, current blog post usage is restricted for educating and information purposes and fall under usual copyright usage terms.

Enjoy "R" Programming!