Coastal Econometrician Views: 2019

Dec 16, 2019

Tip (2) for R to Python and Vice-Versa seamlessly

In continuation to my earlier R to Python tips, in order to deal with both Python and R simultaneously for client requests; this time with respect to plots where both schools as of now by large distinct in their plotting styles; Plotnine a new python package for grammar of graphics will help us with ease for two language enthusiasts.

Plotnine which is almost in same style (> 90%) as ggplot2, claims to be an "implementation of a grammar of graphics in Python, it is based on ggplot2 and allows users to compose plots by explicitly mapping data to the visual objects that make up the plot". Below is a quick example from the new API page:

## Installation
Conda - "conda install -c conda-forge plotnine"
(or)
PIP - "pip install plotnine"

## quick example from API page

from plotnine import ggplot, geom_point, aes, stat_smooth, facet_wrap
from plotnine.data import mtcars

(ggplot(mtcars, aes('wt', 'mpg', color='factor(gear)'))
 + geom_point()
 + stat_smooth(method='lm')
 + facet_wrap('~gear'))

Those who have not yet experience this can try out now and let me know your experiences at mavuluri.pradeep@gmail.com

Happy R and Python Programming!

Nov 22, 2019

Machine Learning (ML) helps in not wasting time in non responsive telemarketing calls

My last year post was more about how ML helped in identifying best time for telemarketing calls. This one on not wasting time in non responsive telemarketing calls.

Views expressed here are from author’s industry experience. Author trains on and blogs Machine (Deep) Learning applications; for further details, he will be available at mavuluri.pradeep@gmail.com for more details. Find more about author at http://in.linkedin.com/in/pradeepmavuluri

Nov 3, 2019

SKU Level Forecasting: Understanding Sales Segements

Understanding you sales segments especially when they are available at SKU level will help us achieving better forecasts. Look for following first, instead of reverse engineer with ML/Statistical techniques and wasting time in front.

1) Sales Variation - Foremost thing one should follow when we have SKU level data is to look for each SKU variability with respect to their average sales or from with in variation sales i.e. coefficient of variation.

1a) In my practical journey of forecasting, I tend to look for zero ratio, which is nothing but, ratio of zero observations to non-zero observations. It tells us a lot about noise which we are going to face.

1b) Also, I look for active, new and ended products, as most of the time clients never provide product life cycles and we combine all or treat all of them as same, forgetting the important impact of life cycle.

1c) Further, in my practical journey of forecasting, I have seen lot of people, in reality failing to understand seasonal traits. Specific example, when seasonality moves from a time point to another (e.g. from one week to other, in cases of specific holidays like Good Friday and Easter). We need to get it right and label it right for modelling.

2) Client Provided Info - exploiting demographic or geographic information that already exists in the data. For instance, city information or product line/item information which makes it unique to that category.

3) Generate Segments - In few cases, I have observed that a bare minimum data about only date and time along with sales is provided for forecasting purposes. In such scenarios, we need to generate segments using unsupervised Machine Learning techniques, however, here, as said above, foremost start from sales variation obtained and proceed for better results.

Once all above, sales segment analysis is done, then job of identifying a better model is damn easy. For instance, a low volume with zeros SKU can be even predicted using a moving average method which provides better forecasting than high end boosting technique.

Views expressed here are from author’s industry experience. Author trains on and blogs Machine (Deep) Learning applications; for further details, he will be available at mavuluri.pradeep@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri

My Two Cents as pragmatic forecaster,
Happy Forecasting!

Oct 17, 2019

Repetitive Q: Reading Multiple Files in the Zip Folder

Dear Readers,

I always see a repetitive question coming to me and across various forums on how to read multiple files in the zip folder of same separator or multiple separator. Again, here, lets not compromise on speed.

Solution is to use easycsv package in R, which in turn uses data.table package function "fread".

Find below a quick example:

library(easycsv)

## Loading required package: data.table

easycsv::fread_zip("xxxx\\alldata.zip", extension="CSV", sep=",")

Additionally, if you want to read and load large files efficiently you can refer to following page:
https://www.kaggle.com/pradeep13/for-r-users-read-load-efficiently-save-time

Happy R Programming!

Views expressed here are from author’s industry experience. Author trains on and blogs Machine (Deep) Learning applications; for further details, he will be available at mavuluri.pradeep@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri

Oct 3, 2019

Forecast Stability Guidance for Model Selection

In real world forecasting task, we don’t have luxury of actuals in hand for better model selection, in such realistic situations, forecast stability can guide us to some extent. Forecast Stability in simple terms, is all about how forecasts behave versus forecasts, we can measure it with simple coefficient of variation. This measure also helps us to understand non-randomness across the data. When we have data at SKU (Store Keep Unit) Level, looking at it regularly provides some extra information that can be used for correcting non-randomness, especially for low volume SKUs. In this notebook exercise, I have three consecutive weeks data for 50 SKUs actuals and presented forecasts, to demonstrate what we can deduce from regular observation of forecast stability with said simple measure.

PS: Demonstration is based on weekly model forecasts, which are either different or same models across weeks based on a selection procedure. Current demonstration selects models based on minimal error across the different time series models namely., ARIMA, SARIMA and ETS based on R package “forecast”.

Below link has the data and R notebook:

https://costaleconomist.blogspot.com/2019/10/forecast-stability-guidance-for-model_1.html

Happy R Programming!

Oct 1, 2019

Forecast Stability Guidance for Model Selection

PS: Demonstration is based on weekly model forecasts, which are either different or same models across weeks based on a selection procedure. Current demonstration selects models based on minimal error across the different time series models namely., ARIMA, SARIMA and ETS based on R package “forecast”.

Below is the data used in this notebook:

skudata <- structure(list(sku_no = 1:50, t_actuals = c(103L, 62L, 52L, 36L, 
37L, 26L, 22L, 12L, 20L, 52L, 52L, 7L, 3L, 5L, 19L, 55L, 74L, 
8L, 1L, 51L, 17L, 18L, 21L, 27L, 114L, 15L, 5L, 12L, 21L, 19L, 
17L, 35L, 73L, 83L, 27L, 25L, 21L, 0L, 0L, 32L, 10L, 13L, 19L, 
5L, 8L, 0L, 3L, 84L, 5L, 73L), t_forecasts = c(79L, 51L, 39L, 
28L, 28L, 43L, 34L, 9L, 27L, 29L, 15L, 4L, 2L, 2L, 11L, 34L, 
31L, 18L, 11L, 39L, 22L, 13L, 24L, 15L, 126L, 5L, 8L, 9L, 18L, 
1L, 1L, 25L, 68L, 73L, 26L, 32L, 30L, 0L, 0L, 16L, 13L, 8L, 20L, 
9L, 34L, 18L, 6L, 96L, 1L, 67L), tplus1_actuals = c(78L, 43L, 
40L, 21L, 16L, 26L, 15L, 28L, 36L, 26L, 11L, 3L, 3L, 12L, 39L, 
47L, 3L, 4L, 31L, 19L, 7L, 20L, 4L, 201L, 9L, 17L, 8L, 1L, 7L, 
19L, 35L, 20L, 20L, 0L, 0L, 30L, 21L, 16L, 46L, 8L, 15L, 11L, 
0L, 14L, 170L, 42L, 22L, 70L, 52L, 10L), tplus1_forecasts = c(60L, 
43L, 27L, 26L, 21L, 22L, 12L, 21L, 43L, 38L, 8L, 6L, 7L, 10L, 
33L, 29L, 25L, 13L, 28L, 22L, 11L, 18L, 13L, 158L, 11L, 8L, 13L, 
1L, 1L, 16L, 46L, 21L, 22L, 1L, 1L, 29L, 13L, 9L, 56L, 11L, 12L, 
11L, 18L, 11L, 149L, 42L, 27L, 66L, 52L, 16L), tplus2_actuals = c(65L, 
32L, 27L, 13L, 14L, 11L, 10L, 16L, 44L, 27L, 10L, 6L, 4L, 11L, 
20L, 43L, 8L, 8L, 23L, 17L, 16L, 12L, 11L, 186L, 10L, 12L, 16L, 
0L, 0L, 21L, 72L, 15L, 21L, 0L, 0L, 16L, 6L, 3L, 36L, 14L, 6L, 
16L, 0L, 9L, 139L, 74L, 34L, 79L, 27L, 8L), tplus2_forecasts = c(70L, 
44L, 32L, 26L, 22L, 22L, 10L, 17L, 45L, 39L, 8L, 7L, 8L, 11L, 
30L, 32L, 25L, 13L, 28L, 22L, 10L, 18L, 15L, 157L, 7L, 8L, 13L, 
1L, 1L, 32L, 95L, 21L, 22L, 1L, 1L, 29L, 9L, 7L, 81L, 11L, 11L, 
41L, 18L, 10L, 78L, 84L, 54L, 134L, 43L, 16L)), class = "data.frame", row.names = c(NA, 
-50L))

Since, it is all about forecasts vs actuals, lets starts with actuals plot; in the below plot, dark red line represents week one actuals, steel blue two dash line represents week two and orange line week three actuals. And, it is evident that week two and three have similar pattern than week one.

library(ggplot2)
ggplot(skudata, aes(x=sku_no)) + 
  geom_line(aes(y = t_actuals), color = "darkred", size=1) + 
  geom_line(aes(y = tplus1_actuals), color="steelblue", linetype="twodash", size=1) +
  geom_line(aes(y = tplus2_actuals), color="orange", linetype=5, size=1) + 
  theme_light() + theme(legend.position="none") + ylab("") + xlab("SKU Number") + labs(title="3 Week Actuals Plot")

Now, let’s look at forecasts, in the below forecast plot, more or less same observation i.e. week one forecasts being different, where as week two and three have same pattern is evident.

ggplot(skudata, aes(x=sku_no)) + 
  geom_line(aes(y = t_forecasts), color = "darkred", size=1) + 
  geom_line(aes(y = tplus1_forecasts), color="steelblue", linetype="twodash", size=1) +
  geom_line(aes(y = tplus2_forecasts), color="orange", linetype=5, size=1) + 
  theme_light() + theme(legend.position="none") + ylab("") + xlab("SKU Number") + labs(title="3 Week Forecats Plot")

PS: Actual model selected for week one and two was SARIMA, and for week three, it is ETS.

Now below, let look at each week’s actuals vs forecast:

ggplot(skudata, aes(x=sku_no)) + 
  geom_line(aes(y = t_actuals), color = "darkred", size=1) + 
  geom_line(aes(y = t_forecasts), color="darkgreen", linetype="twodash", size=1) +
  theme_light() + theme(legend.position="none") + ylab("") + xlab("SKU Number") + labs(title="Week One Actuals vs Forecats Plot")

In week one forecasts, model was able to capture peaks (let’s call non-randomness) that are at 25, 33, 34, 48 and 50, but failed to cope with other SKU’s actual numbers. When we looked at this week’s forecast accuracy which is calculated as weighted absolute percent error i.e. sum of absolute error divided by sum of actuals, it turned out to be 66.31%.

ggplot(skudata, aes(x=sku_no)) + 
  geom_line(aes(y = tplus1_actuals), color = "darkred", size=1) + 
  geom_line(aes(y = tplus1_forecasts), color="darkgreen", linetype="twodash", size=1) +
  theme_light() + theme(legend.position="none") + ylab("") + xlab("SKU Number") + labs(title="Week Two Actuals vs Forecats Plot")

In week two forecasts, model was able to capture both peaks and actuals. And week two forecast accuracy turned out to be 76.01%.

ggplot(skudata, aes(x=sku_no)) + 
  geom_line(aes(y = tplus2_actuals), color = "darkred", size=1) + 
  geom_line(aes(y = tplus2_forecasts), color="darkgreen", linetype="twodash", size=1) +
  theme_light() + theme(legend.position="none") + ylab("") + xlab("SKU Number") + labs(title="Week Three Actuals vs Forecats Plot")

Coming to week three forecasts, model was able to capture peaks and actuals partially for few SKU’s and not for all. And week three forecast accuracy turned out to be 59.38%. Surprisingly, for week three accuracy is even lower than week one.

Now, let’s look at an extra measure i.e. forecast stability; which is nothing but coefficient of variation between two forecasts to understand what is going on with forecasts across the three weeks and compare them with our available actuals.

Week One vs Week Two Forecast Stability:

ferror12 <- abs(skudata$t_forecasts-skudata$tplus1_forecasts)
cv12 <- (sd(ferror12)/mean(ferror12))*100;
cv12

## [1] 134.5675

Week Two vs Week Three Forecast Stability:

ferror23 <- abs(skudata$tplus1_forecasts-skudata$tplus2_forecasts)
cv23 <- (sd(ferror23)/mean(ferror23))*100;
cv23

## [1] 214.9029

Definitely, from above two measures, it is understood that forecasts between week two and three are more unstable by the definition of coefficient of variation measure. That is, some thing went wrong in the model selection. Now, to understand how does above two stability measures can guide us, let us plot forecasts that we are actually using to calculate stability.

ggplot(skudata, aes(x=sku_no)) + 
  geom_line(aes(y = t_forecasts), color = "darkred", size=1) + 
  geom_line(aes(y = tplus1_forecasts), color="darkgreen", linetype="twodash", size=1) +
  theme_light() + theme(legend.position="none") + ylab("") + xlab("SKU Number") + labs(title="Week One vs Two Forecats Plot")

ggplot(skudata, aes(x=sku_no)) + 
  geom_line(aes(y = tplus1_forecasts), color = "darkred", size=1) + 
  geom_line(aes(y = tplus2_forecasts), color="darkgreen", linetype="twodash", size=1) +
  theme_light() + theme(legend.position="none") + ylab("") + xlab("SKU Number") + labs(title="Week Two vs Three Forecats Plot")

From first plot of week one vs two forecast plot, it is evident that there exists lag effect which made week two forecasts to be too good which captured non-randomness very well or in other words auto-regressive and moving average effect being captured appropriately across all SKU’s in the week two with the inputs received from week one, hence, we have better accuracy in the week two (SARIMA model selection was appropriate). However, when coming to week two and three, week three couldn’t perform well across all SKU’s, as it went fro smoothing model based on last week’s input, which one would have been concluded only after receiving actuals in the absence of said simple measure.

Thus, by regularly employing said measure of forecast stability, one can in front, get some guidance about whether selected model is a better one or need to select some other.

My two cents for forecasting world,

Happy R Programming!