Query Language
The best way to get a sense of how the Dictum query is structured is learning the query language. It looks similar to SQL and implements all the same structural elements that are used in Python API and Altair visualizations.
A Dictum query consists of several parts:
- Request for one or more metrics
- Request for one or more dimensions that we want to group the metrics by (with optional transforms)
- A list of dimension filters to apply
- A list of "limits" (metric filters) to apply
Setting up Jupyter¶
Let's load an example project first.
from dictum import Project
project = Project.example("chinook")
You can use the query language by passing the query to the ql
method of Project
:
project.ql("select revenue")
Revenue | |
---|---|
0 | $2,328.60 |
Writing the query as a string is cumbersome, so we can do better: set up Jupyter to understand the queries written directly in the cell.
You can use %ql
magic to write a query in a single line and %%ql
to mark the whole
cell as a query for your project.
project.magic()
%ql select revenue
The magic is registered, now you can use %ql and %%ql to query Chinook Example project
Revenue | |
---|---|
0 | $2,328.60 |
Storing query results in a variable¶
project.ql(...)
returns a special Ql
object, not a Pandas DataFrame
. To get the
DataFrame
, use the df
method.
df = project.ql("select revenue").df()
df
revenue | |
---|---|
0 | 2328.6 |
By default the returned dataframe is not formatted like the query output, because it is
assumed that you want to do something with the values. If you want to keep the formatting,
use format=True
. Note that in this case, all values in the dataframe are strings.
df = project.ql("select revenue").df(format=True)
df
Revenue | |
---|---|
0 | $2,328.60 |
When you're using the %%ql
magic to query Dictum, you can still output the dataframe
into a local variable. Just use the -r
(--result
) CLI-like argument. If you want to
keep the formatting, use -f
(--format-result
).
%%ql -f -r magic_df
select revenue
Returned 1 rows to magic_df
magic_df
Revenue | |
---|---|
0 | $2,328.60 |
Selecting metrics¶
You can query metrics by listing their identifiers after the select
keyword, just like
you select columns in SQL. You can select any number of metrics, regardless of which
tables their source data lives in.
%%ql
select revenue
Revenue | |
---|---|
0 | $2,328.60 |
%%ql
select revenue, n_customers, unique_paying_customers
Revenue | Number of Customers | Unique Paying Customers | |
---|---|---|---|
0 | $2,328.60 | 59 | 59 |
Dictum names the columns after the human-readable metrics names defined in the metric
config. If you want to override them, set an alias with as
keyword.
%%ql
select revenue as "Total Revenue (all time)"
Total Revenue (all time) | |
---|---|
0 | $2,328.60 |
Selecting dimensions¶
To add dimensions, list the IDs after the group by
keyword. The group
part is optional.
Aliases work the same way as with metrics.
%%ql
select revenue
by media_type as "Media File's Type"
Media File's Type | Revenue | |
---|---|---|
0 | AAC audio file | $2.97 |
1 | MPEG audio file | $1,956.24 |
2 | Protected AAC audio file | $144.54 |
3 | Protected MPEG-4 video file | $220.89 |
4 | Purchased AAC audio file | $3.96 |
Filtering¶
Filtering can be done by adding a where
clause to the query. Each item in the clause
must be a dimension reference (possibly transformed) with a boolean value.
For example, there's a dimension called music
that's already boolean:
%%ql
select revenue
where music
by media_type
Media Type | Revenue | |
---|---|---|
0 | AAC audio file | $2.97 |
1 | MPEG audio file | $1,956.24 |
2 | Protected AAC audio file | $144.54 |
3 | Purchased AAC audio file | $3.96 |
Dimension transforms¶
What if you want to see revenue for files that are not music? You can special functions
called transforms to modify how a dimension behaves. Normally transforms look like object
methods (separated by .
symbol).
not
transform inverts a boolean value, True
becomes False
and vice versa. Parentheses
are optional if there are no arguments. music.not()
and music.not
are the same thing.
For the full list of transforms, see Query Language Reference
%%ql
select revenue
where music.not
by genre
Genre | Revenue | |
---|---|---|
0 | Comedy | $17.91 |
1 | Drama | $57.71 |
2 | Sci Fi & Fantasy | $39.80 |
3 | Science Fiction | $11.94 |
4 | TV Shows | $93.53 |
Transforms can be used in the group by
clause the same way. year
transform extracts
the integer year part from a date. Similar transforms exist for quarter
, month
,
week
(number in year), day
(of month), hour
, minute
and second
.
%%ql
select revenue
by invoice_date.year as "Invoice Date (Year)"
Invoice Date (Year) | Revenue | |
---|---|---|
0 | 2009 | $449.46 |
1 | 2010 | $481.45 |
2 | 2011 | $469.58 |
3 | 2012 | $477.53 |
4 | 2013 | $450.58 |
If you want to filter by year, more transforms can be chained.
eq
filter checks if the dimension value is equal to its' argument. For example,
invoice_date.year.eq(2012)
will aggregate only the data for the year 2012. This looks
quite ugly, so the last transform, if it's a common comparison operator, can be written
like in SQL: invoice_date.year = 2012
.
Similar transforms exist for other comparison operators:
ne
(not equals) for<>
gt
for>
ge
for>=
lt
for<
le
for<=
%%ql
select revenue
where invoice_date.year >= 2012
by invoice_date.year as "Year",
invoice_date.quarter as "Quarter"
Year | Quarter | Revenue | |
---|---|---|---|
0 | 2012 | 1 | $112.86 |
1 | 2012 | 2 | $112.86 |
2 | 2012 | 3 | $133.95 |
3 | 2012 | 4 | $117.86 |
4 | 2013 | 1 | $102.96 |
5 | 2013 | 2 | $108.90 |
6 | 2013 | 3 | $112.86 |
7 | 2013 | 4 | $125.86 |
Metric transforms¶
Sometimes you might want to change the behaviour of metrics too. For example, calculate percentages instead of absolute values. This can be done with metric transforms.
To see a list of all metric transforms, see Query Language Reference.
Let's see what we can do with percent
transform.
%%ql -r percent
select revenue, revenue.percent as "Percent of Revenue"
by media_type
Returned 5 rows to percent
percent
media_type | revenue | Percent of Revenue | |
---|---|---|---|
0 | AAC audio file | 2.97 | 0.001275 |
1 | MPEG audio file | 1956.24 | 0.840093 |
2 | Protected AAC audio file | 144.54 | 0.062072 |
3 | Protected MPEG-4 video file | 220.89 | 0.094860 |
4 | Purchased AAC audio file | 3.96 | 0.001701 |
percent.iloc[:,-1].sum() # the data in the last column adds up to 1
1.0
What if we want to calculate percentage of Revenue
by Media Type
within each year?
If we just use revenue.percent
and group by media_type, invoice_date.year
, all rows
will add up to 100%, not all rows within the year. To control what should add up to 100%,
there's a special syntax.
%%ql -fr percent_within_year
select revenue.percent within (invoice_date.year) as "% of Revenue within Year"
by invoice_date.year as "Year", media_type
Returned 19 rows to percent_within_year
percent_within_year.pivot(
index="Media Type",
columns="Year",
).fillna("0%")
% of Revenue within Year | |||||
---|---|---|---|---|---|
Year | 2009 | 2010 | 2011 | 2012 | 2013 |
Media Type | |||||
AAC audio file | 0% | 0% | 0% | 0% | 0% |
MPEG audio file | 96% | 78% | 77% | 78% | 93% |
Protected AAC audio file | 4% | 9% | 9% | 7% | 2% |
Protected MPEG-4 video file | 0% | 13% | 14% | 15% | 6% |
Purchased AAC audio file | 0% | 1% | 0% | 0% | 0% |
Another keyword that controls how percentages are calculated is of
. Using it together
with within
, you can calculate percentages for different groupings of your query
independently.
The only limitation is that all dimension expressions used in of
and within
must also
be present in group by
.
%%ql -fr quarters
select revenue.percent of (invoice_date.quarter) within (invoice_date.year) as "% of Quarter within Year"
by invoice_date.year as Year,
invoice_date.quarter as Quarter,
invoice_date.month as Month
Returned 60 rows to quarters
quarters.pivot(
columns=("Quarter", "Month"),
index="Year",
)
% of Quarter within Year | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Quarter | 1 | 2 | 3 | 4 | ||||||||
Month | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
Year | ||||||||||||
2009 | 25% | 25% | 25% | 25% | 25% | 25% | 25% | 25% | 25% | 25% | 25% | 25% |
2010 | 30% | 30% | 30% | 23% | 23% | 23% | 23% | 23% | 23% | 23% | 23% | 23% |
2011 | 24% | 24% | 24% | 31% | 31% | 31% | 24% | 24% | 24% | 21% | 21% | 21% |
2012 | 24% | 24% | 24% | 24% | 24% | 24% | 28% | 28% | 28% | 25% | 25% | 25% |
2013 | 23% | 23% | 23% | 24% | 24% | 24% | 25% | 25% | 25% | 28% | 28% | 28% |
As you can see, the value is the same for each month in the quarter. The query returned exactly what we asked for: what is the percentage of a particular quarter within a particular year, regardless of other dimensions we might have requested.
total
transform works in a similar way, but outputs the absolute value on the specified
level of detail:
%%ql
select revenue,
revenue.total as "Grand Total",
revenue.total within (Year) as "Total within Year",
revenue.total of (Year) as "Total of Year"
where Time.year >= 2012
by Year, Quarter
Year | Quarter | Revenue | Grand Total | Total within Year | Total of Year | |
---|---|---|---|---|---|---|
0 | 2012 | Q1 2012 | $112.86 | $928.11 | $477.53 | $477.53 |
1 | 2012 | Q2 2012 | $112.86 | $928.11 | $477.53 | $477.53 |
2 | 2012 | Q3 2012 | $133.95 | $928.11 | $477.53 | $477.53 |
3 | 2012 | Q4 2012 | $117.86 | $928.11 | $477.53 | $477.53 |
4 | 2013 | Q1 2013 | $102.96 | $928.11 | $450.58 | $450.58 |
5 | 2013 | Q2 2013 | $108.90 | $928.11 | $450.58 | $450.58 |
6 | 2013 | Q3 2013 | $112.86 | $928.11 | $450.58 | $450.58 |
7 | 2013 | Q4 2013 | $125.86 | $928.11 | $450.58 | $450.58 |
of
and within
for the total
mean the same thing. Total within a year is the total
value of a metric calculated as if there are no other dimensions involved but the year.
The same thing can be said about total of year.
Top-K queries¶
Just like dimension transforms can be used both in groupings and filters, metric transforms can be used in filters too. The concept is the same: there are metric transforms that output a boolean value, that can be the used to filter the result set.
One of the most useful boolean metric transforms is top
.
%%ql
select revenue
by genre
limit revenue.top(5)
Genre | Revenue | |
---|---|---|
0 | Alternative & Punk | $241.56 |
1 | Latin | $382.14 |
2 | Metal | $261.36 |
3 | Rock | $826.65 |
4 | TV Shows | $93.53 |
Just like any other metric transform, top
can be used with of
and within
. Let's
calculate which genre was the most popular within each year.
%%ql
select revenue, items_sold
by Year, album
limit revenue.top(1) within (Year)
Year | Album | Revenue | Number of Items Sold | |
---|---|---|---|---|
0 | 2009 | Minha Historia | $10.89 | 11 |
1 | 2010 | Acústico | $8.91 | 9 |
2 | 2011 | Battlestar Galactica (Classic), Season 1 | $17.91 | 9 |
3 | 2012 | The Office, Season 3 | $21.89 | 11 |
4 | 2013 | Up An' Atom | $9.90 | 10 |
of
and within
can be combined to construct more complex "tops".
We can calculate top-5 countries that gave us the most revenue and then see which city was the leader per country.
%%ql
select revenue,
revenue.percent of (customer_city) as "% Revenue: City in Country",
revenue.total within (customer_country) as "Total Revenue: Country"
by customer_country, customer_city
limit revenue.top(5) of (customer_country),
revenue.top(1) of (customer_city) within (customer_country)
Customer Country | Customer City | Revenue | % Revenue: City in Country | Total Revenue: Country | |
---|---|---|---|---|---|
0 | Brazil | São Paulo | $75.24 | 40% | $190.10 |
1 | Canada | Montréal | $39.62 | 13% | $303.96 |
2 | France | Paris | $77.24 | 40% | $195.10 |
3 | Germany | Berlin | $75.24 | 48% | $156.48 |
4 | USA | Mountain View | $77.24 | 15% | $523.06 |
It's important to notice that total revenue within country is not affected by the limit.
The metric transform is computed before top
is applied, so you can actually see the
value that contributed to the country being in the top-5.