Markov Chain Attribution Modelling With Google Analytics Data In R Programming
Background: How do you assign credit to channels? The ones in Google Analytics > Conversion > MCF > Model comparison have a few models [not including the ‘data driven’ model that assign credit based on First/Last/Linear/Weight/Time decay. The ‘Data Driven’ one doesn’t tell you the details though.
In trying to understand to understand the process and steps, I went through three main sources:
https://www.bounteous.com/insights/2016/06/30/marketing-channel-attribution-markov-models-r/ [which has the script]
https://www.youtube.com/watch?v=mZconLBGeqM [This webinar video explained the transition effect in the easiest way, IMO]
https://www.channelattribution.net/ [Creators of the ChannelAttribution R package]
Markov Chain model in a few sentences: Instead of being purely historical and simplistic such as the MCF models, it uses a probabilistic approach with three main terms: Current state [let’s call it the channel that is currently part of the conversion journey]. Transition Probability [Movement of users from one channel to the other] and Removal Effect [What is the conversion probability when one particular channel is removed from the journey? The difference between a regular journey and the one in which a particular channel is removed is the additional contribution to the conversion by this particular channel].
Sounds like a lot. Yep! I had to watch this video by Decision Analyst to try and understand it better.
Assume in above example, A/B/C/D are channels instead of website. Over here, you can see that 60% of users start at A while 40% start at B. Also, 50% of users move from A to D while another 50% move from A to B. This would be the transition state.
In calculating removal effect, channel A was removed from the journey. Therefore, only journey left was Channel B > C > Conversion.. Knowing the difference between the two probabilities gives 0.15 credit [%] to Channel A.
This is what the package does at scale in R programming.
The Problem with Google Analytics data?
If you looked carefully at the above screenshots, for Markov Chain model to work, it needs to know the null conversion paths OR the paths that did not result in any conversion. This is a problem as Google Analytics > Acquisition > Conv > MCF > Top conversion paths only gives the paths for conversions completed, NOT null conversions.
In order to tackle this, a solution that I read was to create a goal in which every visit is counted as a goal. Blog by Jules Stuifbergen > https://stuifbergen.com/2016/11/conversion-attribution-markov-model-r/
You can do this in GA by setting up a goal where the destination matches regex .* . This will, in effect, record every session as a goal. It’d be better if you created a separate view for this and copied the previous goals over. So in your new view, you would have existing goals + 1 new goal “All visits”
By having this goal, you can now the “conversion paths” for All Visits goal. With some manual work, your All Visits goal minus the actual conversions goal = Paths that did NOT result in a goal. Since you have the conversion paths for both these sets of goals, you can now reframe the inputs to show which conversion paths resulted in an actual goal and which ones didn’t.
I know it sounds a bit messy but it’s coming along [sorta]. We’ll work with an example from my website.
If you check the sample data on channelattribution.net and in the below Shiny app, your columns are the path, total conversions, total_conversion_value and total_null [where conversions didn’t take place]
Preparing The Data
In order to prepare this in Excel, you need to do two downloads from the MCF report. 1. With the All Visits goal and 2. With the Actual goal. In this example, I’m defining conversion as 'people who visit ‘About Me’ page on my site.
With both sheets side by side, you can run a good ol’ INDEX/MATCH to combine the data and have All Visits count and About Me conversion count for each path. Once you know this, the total_null column is just All Visits minus About Me conversions.
You can now save columns A, C,D and E as a separate CSV with comma as a separator.
Ok, once we’re in R, there are a few steps.
Load the libraries and data
Run the regular models [first / last / linear model]
Run the Markov model
Merge the two models into one df
Select which columns to use
Create ggplot
When I load the data in R, for some reason, ‘path’ variable shows up as ‘i..path’. No idea why is this the case. If anybody can help via comments, that’d be great.
Next up, we’re going to load the regular models and Markov.
We then merge the two models and choose which columns to pick. So, in R1 below, we’re picking up the total conversions [counts] as the factor on which to judge the models. You could replace it with total_conversion_value.
Lastly, you load the ggplot for visualization purpose and to compare models side by side.
If you read the above chart, judging by First Touch model, Organic Search is the most important channel. However, as the model progresses to Last Touch model, contribution of Direct traffic increases and is the highest in Markov model. Note, the above model for Last Touch is not the same as Last Non Direct click, which is the default method of channel attribution in Acquisition reports. Default expiry period for campaign params is six months. Any conversion within that time period, the campaign channel gets the conversion, not direct [even if you typed out the site]
If you want to read more about how Direct traffic is calculated [with examples], head over to Avinash Kaushik’s blog post on it. https://www.kaushik.net/avinash/excellent-web-analytics-tip-analyze-direct-traffic/