Analytics Log - Adil Khan

View Original

Data Transformation: Preparing E-comm Product Data For Univariate Time Series Forecast In R

This blog post will look at converting product sales data to make it ready for time series forecasting in R programming. Let’s use the below example dataset where we’d like to use time series data for John Wick OR John Wick 2 only between 1/2/2021 and 1/7/2021 as they are on sale right now [we don’t care about other items in the dataset]

Here, a few things stand out:

  1. There is no row for 1/4/2021

  2. Dates are duplicated as multiple items were sold on a particular day

In essence, we want to do the following:

  1. Mark all rows that contain John Wick OR John Wick 2 as rows that need to be selected , say : “In offer”

  2. Extract all such rows where condition “In offer” is met

  3. Group data by date

  4. Decide how to tackle missing row(s) [Remember : 1/4/2021 is missing from dataset]

  5. Run forecast [separate post: See blog post on running time series forecasts via modeltime package in R]

Let’s start with loading the data in R.

When you use Import the dataset, the date comes in double format. Click on the drop-down and change it to Date type.

Next, when you run str(data) in R, Items field shows as Character [not Factor i.e. Categorical variable] and date shows as POSIXct.

Therefore, we can now convert these two variables. Date to as.Date and Item as as.Factor.

data$Item <- as.factor(data$Item)

`data$Date<- as.Date(data$Date)`

See this content in the original post

In above block of code, there are 3 main things that are happening:

  1. We’re marking rows that contain John Wick OR John Wick 2 via Regex, mutate function and str_detect function as “In offer” while other rows get marked as “Not in offer”

  2. Two subset conditions are being set. A. The date has to be greater than 1/1/2021 and inOffer column val == “In offer”

  3. We’re grouping the data by Date and doing a sum for each day [exactly how a pivot table in Excel would]

Your dataset should now look like below with no data for 1/3/2021 and 1/4/2021 [as expected]

We now need to add the missing dates. By doing so, it will first lead to NA values for such dates. As a result, we then need to add zeroes for missing row values.

There is a package called padr - which adds missing rows.

You can now use dplyr to replace zeroes via replace function

See this content in the original post

You can add the missing datea via padr or via the complete funciton.

Adding in the replace function, it checks for NA values and replaces them with zero.

Once you’re done, you can now run time series forecasts using the modeltime package in R.