Recently, I have been working with analytics in PostgreSQL. Often I hit the database through Ruby on Rails’ ActiveRecord ORM. But with the tens of millions of rows I needed to crunch for analytics, I achieved significantly better performance and clearer code by writing raw SQL.
Here’s a compilation of things I learned or used frequently.
Reference / tl;dr
||list available commands|
||connect to a database|
||list tables in the current database|
||list columns in the provided table|
||toggle extended format, which gives each value more room: its own line instead of the typical tabular display|
||turn off paging, to print all results without the need for manual scrolling|
||round down to an integer|
||round up or down to the nearest integer|
||round to the specified number of decimal places|
||the number of rows where expression is not null|
||the number of rows where expression is true|
||true if expression is true for any row, false otherwise|
Note: This implementation does not provide the true median since, for simplicity, it does not average the two middle values when the row count is even.
SELECT MAX(column) as "Median of Column" FROM ( SELECT column, ntile(2) OVER (ORDER BY column) AS bucket FROM table ) as t WHERE bucket = 1 GROUP BY bucket;
Aggregating Over A Table
An aggregate function yields a single value from a bunch of rows. For example, I can find out how many rows are in the
SELECT COUNT(*) FROM events; count --------- 2732857
Aggregating Over Groups
When used with a
GROUP BY, aggregate functions apply to each group separately. For example, I can find out how many rows there are for each
type of event in the
SELECT type, COUNT(*) FROM events GROUP BY type; type | count ----------+-------- sign-up | 683214 visit | 2049643
Counting Rows That Satisfy A Condition
COUNT(expression) aggregate function counts only those rows for which expression is not null. With
COUNT(boolean-expression OR NULL), we can count only those rows for which a boolean expression is true, since
boolean-expression OR NULL evaluates to null when the boolean expression is false:
SELECT user_id, COUNT(type = 'purchase-see-offer' OR NULL), COUNT(type = 'purchase-begin' OR NULL), COUNT(type = 'purchase-complete' OR NULL) FROM events GROUP BY user_id
user_id | count | count | count ---------+-------+-------+-------- 134294 | 14 | 0 | 0 443291 | 12 | 2 | 1
Determining Whether Any Row Satisfies A Condition
bool_or(expression) aggregate returns true if expression is true for any row, false otherwise.
SELECT user_id, BOOL_OR(type = 'purchase-see-offer'), BOOL_OR(type = 'purchase-begin'), BOOL_OR(type = 'purchase-complete') FROM events GROUP BY user_id
user_id | count | count | count ---------+-------+-------+-------- 134294 | t | f | f 443291 | t | t | t
Median and Window Functions
Suppose we want the median of
column, and that we happen to have a table ordered by
column with a
bucket column that is 1 for rows in the first half of the table and 2 for rows in the second half.
SELECT column, bucket FROM t ORDER BY column; column | bucket --------+-------- 5 | 1 14 | 1 27 | 1 39 | 2 106 | 2
From such a table we can get the median by grouping by bucket, and then selecting the maximum value of bucket 1:
SELECT max(column) FROM t WHERE bucket = 1 GROUP BY bucket; max ----- 27
Of course, we don't have tables with a magical bucket column. But Postgres' built-in
ntile(number_of_buckets) window function will make it for us.
Like an aggregate function, a window function computes values by looking at a bunch of rows. It can look at the whole table, in fact, and will unless some kind of partitioning is specified. However, unlike an aggregate function, it produces a value for each row. It doesn't group.
For example, Postgres provides a built-in window function to number rows:
SELECT column, row_number() OVER () FROM t; column | row_number --------+------------ 5 | 1 106 | 2 14 | 3 39 | 4 27 | 5
And I can use any aggregate function as a window function by providing an over clause. For example, to include a column's maximum value in each row:
SELECT column, max(column) OVER () FROM t; column | max --------+----- 5 | 106 106 | 106 14 | 106 39 | 106 27 | 106
And finally, there's the window function we're interested in:
ntile. It takes a number of buckets as arguments. And it numbers rows in a way that breaks them up as evenly as possible into the specified number of buckets:
SELECT column, ntile(2) OVER () FROM t; column | ntile --------+------- 5 | 1 106 | 1 14 | 1 39 | 2 27 | 2
The one missing piece for our median is making sure the rows are ordered by the column we're trying to find the median of. It's easy to overlook, and would ruin the median calculation highlighted above. The remedy: provide an
ORDER BY within the
SELECT column, ntile(2) OVER (ORDER BY column) FROM t; column | ntile --------+------- 5 | 1 14 | 1 27 | 1 39 | 2 106 | 2
Putting It All Together
SELECT max(column) AS median FROM ( SELECT column, ntile(2) OVER (ORDER BY column) AS bucket FROM table ) as t WHERE bucket = 1 GROUP BY bucket;