BigQuery ML + Google Analytics: Forecast sales For Multiple SKU's Using A Single Query + Model

BigQuery ML + Google Analytics: Forecast sales For Multiple SKU's Using A Single Query + Model

While BigQuery ML documentation does have a demo on multiple time series forecasting using the NYC Citi bike trips dataset, I thought it would be helpful to apply this concept in the context of Google Analytics data in BigQuery, especially e-commerce. This post does look at forecasting quantities expected to be sold over X days but the query can be changed to any hit data that’s available in BigQuery.

The dataset here is the Google merchandise store data, available between Aug 2016 and Jul 2018. I’ll be using a shorter period for the demo though.

Note: This post is not from a stats perspective so it doesn’t dive into fine tuning ARIMA model or working with coefficients etc, those are topics on their own. The focus is solely on GA > BigQuery > Model > DataStudio for forecast to work with a wider audience.

Let’s start with the basics.

You need to have a Google Cloud Platform project, enable the BigQuery API and create a dataset with US location. This is required otherwise the query will fail [because the public datasets are stored in US multi-region location].

The query:

In my example, I want to build forecasts for quantities expected to be sold for two particular SKU’s on the merchandise store. There are two portions here: A. The query to retrieve the data. B. The model building part.

Let’s focus on the query. I’ll first paste the query part and then talk about it line-by-line. Shout out to Mr. Mikhail Berlyant on StackOverflow for helping tweak the query.

Here’s the query to fetch the data:

SELECT
PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date, 
prod.productSKU AS SKU,
IFNULL(SUM(prod.productQuantity),0) AS qty_purchased
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`, 
UNNEST (hits) hit, UNNEST(product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170228'
AND
REGEXP_CONTAINS (prod.productSKU,"GGOEGGOA017399|GGOEGOXQ016399")
GROUP BY parsed_date, SKU
ORDER BY parsed_date ASC

Break it down, pls!

What columns to SELECT?

SELECT
PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date, 
prod.productSKU AS SKU,
IFNULL(SUM(prod.productQuantity),0) AS qty_purchased

So, we’re telling BigQuery to SELECT three columns from the dataset. First one is the date converted from YYYYMMDD to a timestamp. Second is the SKU name and third is the quantity sold. IFNULL adds a 0 to qty_purchased IF the condition is met.

FROM which table?

FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`, 

This line goes through the Google Analytics dataset and the wildcard at the end means that it will crawl through all the partitioned tables [unless you enter date operators, which we will below]

Flatten the table via UNNEST

UNNEST (hits) hit, UNNEST(product) prod

This is where I got stuck and headed over to SO for help. UNNEST helps in flattening the table. In my original query, I had only flattened the hits [as they form a session] but the solution was to flatten the products as well. Why? Goes back to the basic Google Analytics scope which could be user, session, hit or product. In a single transaction, you could have multiple products. Hence, the need to flatten products as well.

Any conditions to check WHERE should BigQuery fetch the data?



WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170228'
AND
REGEXP_CONTAINS (prod.productSKU,"GGOEGGOA017399|GGOEGOXQ016399")

I’ve put two conditions here. First, between Jan 1, 2017 and Feb 28, 2017 AND a REGEXP_CONTAINS parameter where it will search through the prod.productSKU field for SKUs “GGOEGGOA017399” OR “GGOEGOXQ016399”, separated by OR condition via the pipe symbol.

Now that we have the query, we then build a model.

CREATE OR REPLACE MODEL bqml_tutorial.ga_ts_multiple_prod_model_with_top_two_skus
OPTIONS
(model_type='ARIMA',
time_series_timestamp_col = 'parsed_date',
time_series_data_col = 'qty_purchased',
time_series_id_col = 'SKU')

Straightforward enough. Model uses (auto ARIMA) with the time-series being the parsed_date column, the field to forecast is the qty_purchased while the “pivot” would be on SKU. This would, in effect, create multiple time series forecast models.

Putting the query and model together. Run the below.

CREATE OR REPLACE MODEL bqml_tutorial.ga_ts_multiple_prod_model_with_top_two_skus
OPTIONS
(model_type='ARIMA',
time_series_timestamp_col = 'parsed_date',
time_series_data_col = 'qty_purchased',
time_series_id_col = 'SKU')
AS 
SELECT

PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date, 
prod.productSKU AS SKU,
IFNULL(SUM(prod.productQuantity),0) AS qty_purchased
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`, 
UNNEST (hits) hit, UNNEST(product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170228'
AND
REGEXP_CONTAINS (prod.productSKU,"GGOEGGOA017399|GGOEGOXQ016399")
GROUP BY parsed_date, SKU
ORDER BY parsed_date ASC

You need to evaluate the model and inspect co-efficients, which are available via below two queries.

#standardSQL
SELECT
  *
FROM
  ML.EVALUATE(MODEL bqml_tutorial.ga_ts_multiple_prod_model_with_top_two_skus)
#standardSQL
SELECT
  *
FROM
  ML.ARIMA_COEFFICIENTS(MODEL bqml_tutorial.ga_ts_multiple_prod_model_with_top_two_skus)

Building forecasts in BigQuery

 
  SELECT
  *
FROM
  ML.FORECAST(MODEL bqml_tutorial.ga_ts_multiple_prod_model_with_top_two_skus,
              STRUCT(5 AS horizon, 0.9 AS confidence_level))

In this case, we’re saying build a forecast for the next 5 days [using Jan 1 - Feb 28, 2017 data in the model] using 90% confidence interval. Run the query > Explore in DataStudio > You now have the freedom to show it however you want.

yes, i did notice the negative values.

yes, i did notice the negative values.

At it’s core, the query works. Depending on your actual data, you might not see such values. Again, this was just a demo with demo values. I would also need to address this via a separate blog post on how to tweak the model to only show non-zero values.

That’s it. If you want to have a fixed start date and a rolling end date, you can use the below solution that I found here: https://datarunsdeep.com.au/blog/use-our-bigquery-statement-builder-better-understand-behaviours-your-customers

AND _TABLE_SUFFIX BETWEEN '20200101' AND FORMAT_DATE( '%Y%m%d', DATE_SUB( CURRENT_DATE ('Australia/Melbourne'), INTERVAL 1 DAY))
 

Hope you found this post helpful.

Temporarily Exclude Users Option vs Permanent - Custom Audiences In Google Analytics 4

Temporarily Exclude Users Option vs Permanent - Custom Audiences In Google Analytics 4

BigQuery for Google Analytics App + Web: Get User Counts By Date Between Two Fixed Dates

BigQuery for Google Analytics App + Web: Get User Counts By Date Between Two Fixed Dates