Tidyquery: Query DataFrames With SQL Syntax Inside R Programming
I recently had some e-commerce product sales data and wanted to check on how to quickly run SQL type queries on a single table dataset. Instead of switching platforms, decided to stay within R and search for solutions. Came across the Tidyquery package and it’s supposed to only query data as opposed to SQLdf package that load df data into a database format > https://github.com/ianmcook/tidyquery
Here’s a sample dataset where each row represents the SKU sales within a particular transaction.
OrderId values are repeated when a transaction contains multiple products. Example: Rows 1 and 2 where US12398 is in both rows as the transaction contained SKU A and SKU B
Instead of going inside Excel and running Distinct Counts on unique transaction counts, filtering by date and category, let’s do this all inside a single SQL query in R.
Once you load the library(tidyquery), you can use the Query function to…query.
Our sample dataset is really small so let’s go ahead with SELECT * FROM
Ok, so let’s try a realistic query within the dataframe.
Between 1/1/2021 and 1/7/2021, count the # of unique transactions, sum total quantities sold where Category ==Category1 and organize the data by Store [OrderId has the stores [US/CA/UK]
Our results should give 3 transactions and 28 quantities sold in total, grouped by the store.
Let’s query the data in progression, building up the complexity and then show how the Store can be extracted and chained via dplyr %>% to run the SQL query immediately [full query towards the end].
As the rows can contain the same OrderId, we ran a COUNT_DISTINCT on the dataset.
We now added a column to sum the total quantity sold.
We now added a condition WHERE Category equals ‘Category1’
By adding the date condition as a nested parameter, we effectively have three conditions for WHERE. 1. Category == Category1, Date should be BETWEEN 1/1/2021 and 1/7/2021
A note on the Date condition. In order for Tidyquery to read the dates, we need to use TO_DATE in the query. This is slightly different from regular SQL syntax. I got stuck with this aspect and sought the answer on StackOverflow.
Ok, so we saw how the progressed bit by bit. Now, we need to show the data at the store level. At the moment, there is no column for Store, so let’s create one!
str_detect is being used to run IFELSE on the OrderId column and create a new column called Store with the appropriate values.
You can also chain the dplyr mutate function in the same Tidyquery to execute it all in one query.
The GROUP_BY aspect in the Tidyquery groups data by Store.
That’s it. This was a rather simple example of how to use SQL syntax in R Programming. Did you find it helpful?
The github page by the creator has more details on the package. https://github.com/ianmcook/tidyquery