PostgreSQL: Aggregates, Medians, and a Brief Command Reference

Recently, I have been working with analytics in PostgreSQL. Often I hit the database through Ruby on Rails’ ActiveRecord ORM. But with the tens of millions of rows I needed to crunch for analytics, I achieved significantly better performance and clearer code by writing raw SQL.

Here’s a compilation of things I learned or used frequently.

Reference / tl;dr

Basic Commands

\? list available commands
\l list databases
\c database connect to a database
\d list tables in the current database
\d table list columns in the provided table

Handy Commands

\x toggle extended format, which gives each value more room: its own line instead of the typical tabular display
\pset pager (off|on) turn off paging, to print all results without the need for manual scrolling

Math

floor(num) round down to an integer
round(num) round up or down to the nearest integer
round(num, decimal_places) round to the specified number of decimal places

Official Documentation

Aggregates

count(expression) the number of rows where expression is not null
count(expression OR NULL) the number of rows where expression is true
bool_or(expression) true if expression is true for any row, false otherwise

Official Documentation

(Approximate) Median

Note: This implementation does not provide the true median since, for simplicity, it does not average the two middle values when the row count is even.

SELECT
  MAX(column) as "Median of Column"
FROM
  (
    SELECT
      column,
      ntile(2) OVER (ORDER BY column) AS bucket
    FROM
      table
  ) as t
WHERE bucket = 1
GROUP BY bucket;

Official Documentation about window functions

Official Documentation of built-in window functions

Details

Aggregates

Aggregating Over A Table

An aggregate function yields a single value from a bunch of rows. For example, I can find out how many rows are in the events table:

SELECT COUNT(*) FROM events;
  count 
---------
 2732857 

Aggregating Over Groups

When used with a GROUP BY, aggregate functions apply to each group separately. For example, I can find out how many rows there are for each type of event in the events table:

SELECT type, COUNT(*) FROM events GROUP BY type;
   type   | count
----------+--------
 sign-up  | 683214
 visit    | 2049643

Counting Rows That Satisfy A Condition

The COUNT(expression) aggregate function counts only those rows for which expression is not null. With COUNT(boolean-expression OR NULL), we can count only those rows for which a boolean expression is true, since boolean-expression OR NULL evaluates to null when the boolean expression is false:

SELECT
  user_id,
  COUNT(type = 'purchase-see-offer' OR NULL),
  COUNT(type = 'purchase-begin' OR NULL),
  COUNT(type = 'purchase-complete' OR NULL)
FROM events
GROUP BY user_id


 user_id | count | count |  count 
---------+-------+-------+--------
 134294  |    14 |     0 |      0 
 443291  |    12 |     2 |      1 

Determining Whether Any Row Satisfies A Condition

The bool_or(expression) aggregate returns true if expression is true for any row, false otherwise.

SELECT
  user_id,
  BOOL_OR(type = 'purchase-see-offer'),
  BOOL_OR(type = 'purchase-begin'),
  BOOL_OR(type = 'purchase-complete')
FROM events
GROUP BY user_id


 user_id | count | count |  count 
---------+-------+-------+--------
 134294  |     t |     f |      f 
 443291  |     t |     t |      t 

Official Documentation

Median and Window Functions

Suppose we want the median of column, and that we happen to have a table ordered by column with a bucket column that is 1 for rows in the first half of the table and 2 for rows in the second half.

SELECT column, bucket FROM t ORDER BY column;
 column | bucket
--------+--------
      5 |      1
     14 |      1
     27 |      1
     39 |      2
    106 |      2

From such a table we can get the median by grouping by bucket, and then selecting the maximum value of bucket 1:

SELECT max(column) FROM t WHERE bucket = 1 GROUP BY bucket;
 max
-----
  27

Of course, we don't have tables with a magical bucket column. But Postgres' built-in ntile(number_of_buckets) window function will make it for us.

Window Functions

Like an aggregate function, a window function computes values by looking at a bunch of rows. It can look at the whole table, in fact, and will unless some kind of partitioning is specified. However, unlike an aggregate function, it produces a value for each row. It doesn't group.

For example, Postgres provides a built-in window function to number rows:

SELECT column, row_number() OVER () FROM t;
 column | row_number 
--------+------------
      5 |          1 
    106 |          2 
     14 |          3 
     39 |          4 
     27 |          5 

And I can use any aggregate function as a window function by providing an over clause. For example, to include a column's maximum value in each row:

SELECT column, max(column) OVER () FROM t;
 column | max
--------+-----
      5 | 106 
    106 | 106 
     14 | 106 
     39 | 106 
     27 | 106 

And finally, there's the window function we're interested in: ntile. It takes a number of buckets as arguments. And it numbers rows in a way that breaks them up as evenly as possible into the specified number of buckets:

SELECT column, ntile(2) OVER () FROM t;
 column | ntile
--------+-------
      5 |     1
    106 |     1
     14 |     1
     39 |     2
     27 |     2

Order By

The one missing piece for our median is making sure the rows are ordered by the column we're trying to find the median of. It's easy to overlook, and would ruin the median calculation highlighted above. The remedy: provide an ORDER BY within the OVER clause:

SELECT column, ntile(2) OVER (ORDER BY column) FROM t;
 column | ntile
--------+-------
      5 |     1
     14 |     1
     27 |     1
     39 |     2
    106 |     2

Putting It All Together

SELECT
  max(column) AS median
FROM
  (
    SELECT
      column,
      ntile(2) OVER (ORDER BY column) AS bucket
    FROM
      table
  ) as t
WHERE bucket = 1
GROUP BY bucket;

Official Documentation about window functions

Official Documentation of built-in window functions

This entry was posted in Database and tagged . Bookmark the permalink.

2 Responses to PostgreSQL: Aggregates, Medians, and a Brief Command Reference

  1. Pingback: ClearCove Software Inc. – Great summary for doing data analysis in PostgreSQL

  2. Victor Ivanov says:

    Nice post!

    Can you explain why you used “group by bucket”, looks like it is not necessary here.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>