By Clark Cutler

Recently, I have been working with analytics in PostgreSQL. Often I hit the database through Ruby on Rails’ ActiveRecord ORM. But with the tens of millions of rows I needed to crunch for analytics, I achieved significantly better performance and clearer code by writing raw SQL.

Here’s a compilation of things I learned or used frequently.

`\?` |
list available commands |

`\l` |
list databases |

`\c ` |
connect to a database |

`\d` |
list tables in the current database |

`\d ` |
list columns in the provided table |

`\x` |
toggle extended format, which gives each value more room: its own line instead of the typical tabular display |

`\pset pager (off|on)` |
turn off paging, to print all results without the need for manual scrolling |

`floor(` |
round down to an integer |

`round(` |
round up or down to the nearest integer |

`round(` |
round to the specified number of decimal places |

`count(` |
the number of rows where expression is not null |

`count(` |
the number of rows where expression is true |

`bool_or(` |
true if expression is true for any row, false otherwise |

Note: This implementation does not provide the true median since, for simplicity, it does not average the two middle values when the row count is even.

```
```SELECT
MAX(column) as "Median of Column"
FROM
(
SELECT
column,
ntile(2) OVER (ORDER BY column) AS bucket
FROM
table
) as t
WHERE bucket = 1
GROUP BY bucket;

Official Documentation about window functions

Official Documentation of built-in window functions

An aggregate function yields a single value from a bunch of rows. For example, I can find out how many rows are in the `events`

table:

```
```SELECT COUNT(*) FROM events;
count
---------
2732857

When used with a `GROUP BY`

, aggregate functions apply to each group separately. For example, I can find out how many rows there are for each `type`

of event in the `events`

table:

```
```SELECT type, COUNT(*) FROM events GROUP BY type;
type | count
----------+--------
sign-up | 683214
visit | 2049643

The `COUNT(`

aggregate function counts only those rows for which expression is not null. With *expression*)`COUNT(`

, we can count only those rows for which a boolean expression is true, since *boolean-expression* OR NULL)

evaluates to null when the boolean expression is false:*boolean-expression* OR NULL

```
```SELECT
user_id,
COUNT(type = 'purchase-see-offer' OR NULL),
COUNT(type = 'purchase-begin' OR NULL),
COUNT(type = 'purchase-complete' OR NULL)
FROM events
GROUP BY user_id

```
``` user_id | count | count | count
---------+-------+-------+--------
134294 | 14 | 0 | 0
443291 | 12 | 2 | 1

The `bool_or(`

aggregate returns true if expression is true for any row, false otherwise.*expression*)

```
```SELECT
user_id,
BOOL_OR(type = 'purchase-see-offer'),
BOOL_OR(type = 'purchase-begin'),
BOOL_OR(type = 'purchase-complete')
FROM events
GROUP BY user_id

```
``` user_id | count | count | count
---------+-------+-------+--------
134294 | t | f | f
443291 | t | t | t

Suppose we want the median of

, and that we happen to have a table ordered by *column*

with a *column*`bucket`

column that is 1 for rows in the first half of the table and 2 for rows in the second half.

```
```SELECT column, bucket FROM t ORDER BY column;
column | bucket
--------+--------
5 | 1
14 | 1
27 | 1
39 | 2
106 | 2

From such a table we can get the median by grouping by bucket, and then selecting the maximum value of bucket 1:

```
```SELECT max(column) FROM t WHERE bucket = 1 GROUP BY bucket;
max
-----
27

Of course, we don't have tables with a magical bucket column. But Postgres' built-in `ntile(`

window function will make it for us.*number_of_buckets*)

Like an aggregate function, a window function computes values by looking at a bunch of rows. It can look at the whole table, in fact, and will unless some kind of partitioning is specified. However, unlike an aggregate function, it produces a value for each row. It doesn't group.

For example, Postgres provides a built-in window function to number rows:

```
```SELECT column, row_number() OVER () FROM t;
column | row_number
--------+------------
5 | 1
106 | 2
14 | 3
39 | 4
27 | 5

And I can use any aggregate function as a window function by providing an over clause. For example, to include a column's maximum value in each row:

```
```SELECT column, max(column) OVER () FROM t;
column | max
--------+-----
5 | 106
106 | 106
14 | 106
39 | 106
27 | 106

And finally, there's the window function we're interested in: `ntile`

. It takes a number of buckets as arguments. And it numbers rows in a way that breaks them up as evenly as possible into the specified number of buckets:

```
```SELECT column, ntile(2) OVER () FROM t;
column | ntile
--------+-------
5 | 1
106 | 1
14 | 1
39 | 2
27 | 2

The one missing piece for our median is making sure the rows are ordered by the column we're trying to find the median of. It's easy to overlook, and would ruin the median calculation highlighted above. The remedy: provide an `ORDER BY`

within the `OVER`

clause:

```
```SELECT column, ntile(2) OVER (ORDER BY column) FROM t;
column | ntile
--------+-------
5 | 1
14 | 1
27 | 1
39 | 2
106 | 2

```
```SELECT
max(column) AS median
FROM
(
SELECT
column,
ntile(2) OVER (ORDER BY column) AS bucket
FROM
table
) as t
WHERE bucket = 1
GROUP BY bucket;

Official Documentation about window functions

Official Documentation of built-in window functions

Interested in a Career at Carbon Five? Check out our job openings.