Recently, I have been working with analytics in PostgreSQL. Often I hit the database through Ruby on Rails’ ActiveRecord ORM. But with the tens of millions of rows I needed to crunch for analytics, I achieved significantly better performance and clearer code by writing raw SQL.
Here’s a compilation of things I learned or used frequently.
\? |
list available commands |
\l |
list databases |
\c database |
connect to a database |
\d |
list tables in the current database |
\d table |
list columns in the provided table |
\x |
toggle extended format, which gives each value more room: its own line instead of the typical tabular display |
\pset pager (off|on) |
turn off paging, to print all results without the need for manual scrolling |
floor(num) |
round down to an integer |
round(num) |
round up or down to the nearest integer |
round(num, decimal_places) |
round to the specified number of decimal places |
count(expression) |
the number of rows where expression is not null |
count(expression OR NULL) |
the number of rows where expression is true |
bool_or(expression) |
true if expression is true for any row, false otherwise |
Note: This implementation does not provide the true median since, for simplicity, it does not average the two middle values when the row count is even.
SELECT MAX(column) as "Median of Column" FROM ( SELECT column, ntile(2) OVER (ORDER BY column) AS bucket FROM table ) as t WHERE bucket = 1 GROUP BY bucket;
Official Documentation about window functions
Official Documentation of built-in window functions
An aggregate function yields a single value from a bunch of rows. For example, I can find out how many rows are in the events
table:
SELECT COUNT(*) FROM events; count --------- 2732857
When used with a GROUP BY
, aggregate functions apply to each group separately. For example, I can find out how many rows there are for each type
of event in the events
table:
SELECT type, COUNT(*) FROM events GROUP BY type; type | count ----------+-------- sign-up | 683214 visit | 2049643
The COUNT(expression)
aggregate function counts only those rows for which expression is not null. With COUNT(boolean-expression OR NULL)
, we can count only those rows for which a boolean expression is true, since boolean-expression OR NULL
evaluates to null when the boolean expression is false:
SELECT user_id, COUNT(type = 'purchase-see-offer' OR NULL), COUNT(type = 'purchase-begin' OR NULL), COUNT(type = 'purchase-complete' OR NULL) FROM events GROUP BY user_id
user_id | count | count | count ---------+-------+-------+-------- 134294 | 14 | 0 | 0 443291 | 12 | 2 | 1
The bool_or(expression)
aggregate returns true if expression is true for any row, false otherwise.
SELECT user_id, BOOL_OR(type = 'purchase-see-offer'), BOOL_OR(type = 'purchase-begin'), BOOL_OR(type = 'purchase-complete') FROM events GROUP BY user_id
user_id | count | count | count ---------+-------+-------+-------- 134294 | t | f | f 443291 | t | t | t
Suppose we want the median of column
, and that we happen to have a table ordered by column
with a bucket
column that is 1 for rows in the first half of the table and 2 for rows in the second half.
SELECT column, bucket FROM t ORDER BY column; column | bucket --------+-------- 5 | 1 14 | 1 27 | 1 39 | 2 106 | 2
From such a table we can get the median by grouping by bucket, and then selecting the maximum value of bucket 1:
SELECT max(column) FROM t WHERE bucket = 1 GROUP BY bucket; max ----- 27
Of course, we don't have tables with a magical bucket column. But Postgres' built-in ntile(number_of_buckets)
window function will make it for us.
Like an aggregate function, a window function computes values by looking at a bunch of rows. It can look at the whole table, in fact, and will unless some kind of partitioning is specified. However, unlike an aggregate function, it produces a value for each row. It doesn't group.
For example, Postgres provides a built-in window function to number rows:
SELECT column, row_number() OVER () FROM t; column | row_number --------+------------ 5 | 1 106 | 2 14 | 3 39 | 4 27 | 5
And I can use any aggregate function as a window function by providing an over clause. For example, to include a column's maximum value in each row:
SELECT column, max(column) OVER () FROM t; column | max --------+----- 5 | 106 106 | 106 14 | 106 39 | 106 27 | 106
And finally, there's the window function we're interested in: ntile
. It takes a number of buckets as arguments. And it numbers rows in a way that breaks them up as evenly as possible into the specified number of buckets:
SELECT column, ntile(2) OVER () FROM t; column | ntile --------+------- 5 | 1 106 | 1 14 | 1 39 | 2 27 | 2
The one missing piece for our median is making sure the rows are ordered by the column we're trying to find the median of. It's easy to overlook, and would ruin the median calculation highlighted above. The remedy: provide an ORDER BY
within the OVER
clause:
SELECT column, ntile(2) OVER (ORDER BY column) FROM t; column | ntile --------+------- 5 | 1 14 | 1 27 | 1 39 | 2 106 | 2
SELECT max(column) AS median FROM ( SELECT column, ntile(2) OVER (ORDER BY column) AS bucket FROM table ) as t WHERE bucket = 1 GROUP BY bucket;
Official Documentation about window functions
Official Documentation of built-in window functions