Recently, I have been working with analytics in PostgreSQL. Often I hit the database through Ruby on Rails’ ActiveRecord ORM. But with the tens of millions of rows I needed to crunch for analytics, I achieved significantly better performance and clearer code by writing raw SQL.

Here’s a compilation of things I learned or used frequently.

## Reference / tl;dr

### Basic Commands

`\?` |
list available commands |

`\l` |
list databases |

`\c ` |
connect to a database |

`\d` |
list tables in the current database |

`\d ` |
list columns in the provided table |

### Handy Commands

`\x` |
toggle extended format, which gives each value more room: its own line instead of the typical tabular display |

`\pset pager (off|on)` |
turn off paging, to print all results without the need for manual scrolling |

### Math

`floor(` |
round down to an integer |

`round(` |
round up or down to the nearest integer |

`round(` |
round to the specified number of decimal places |

### Aggregates

`count(` |
the number of rows where expression is not null |

`count(` |
the number of rows where expression is true |

`bool_or(` |
true if expression is true for any row, false otherwise |

### (Approximate) Median

Note: This implementation does not provide the true median since, for simplicity, it does not average the two middle values when the row count is even.

```
```SELECT
MAX(column) as "Median of Column"
FROM
(
SELECT
column,
ntile(2) OVER (ORDER BY column) AS bucket
FROM
table
) as t
WHERE bucket = 1
GROUP BY bucket;

Official Documentation about window functions

Official Documentation of built-in window functions

## Details

### Aggregates

#### Aggregating Over A Table

An aggregate function yields a single value from a bunch of rows. For example, I can find out how many rows are in the `events`

table:

```
```SELECT COUNT(*) FROM events;
count
---------
2732857

#### Aggregating Over Groups

When used with a `GROUP BY`

, aggregate functions apply to each group separately. For example, I can find out how many rows there are for each `type`

of event in the `events`

table:

```
```SELECT type, COUNT(*) FROM events GROUP BY type;
type | count
----------+--------
sign-up | 683214
visit | 2049643

#### Counting Rows That Satisfy A Condition

The `COUNT(`

aggregate function counts only those rows for which expression is not null. With *expression*)`COUNT(`

, we can count only those rows for which a boolean expression is true, since *boolean-expression* OR NULL)

evaluates to null when the boolean expression is false:*boolean-expression* OR NULL

```
```SELECT
user_id,
COUNT(type = 'purchase-see-offer' OR NULL),
COUNT(type = 'purchase-begin' OR NULL),
COUNT(type = 'purchase-complete' OR NULL)
FROM events
GROUP BY user_id

```
``` user_id | count | count | count
---------+-------+-------+--------
134294 | 14 | 0 | 0
443291 | 12 | 2 | 1

#### Determining Whether Any Row Satisfies A Condition

The `bool_or(`

aggregate returns true if expression is true for any row, false otherwise.*expression*)

```
```SELECT
user_id,
BOOL_OR(type = 'purchase-see-offer'),
BOOL_OR(type = 'purchase-begin'),
BOOL_OR(type = 'purchase-complete')
FROM events
GROUP BY user_id

```
``` user_id | count | count | count
---------+-------+-------+--------
134294 | t | f | f
443291 | t | t | t

### Median and Window Functions

Suppose we want the median of

, and that we happen to have a table ordered by *column*

with a *column*`bucket`

column that is 1 for rows in the first half of the table and 2 for rows in the second half.

```
```SELECT column, bucket FROM t ORDER BY column;
column | bucket
--------+--------
5 | 1
14 | 1
27 | 1
39 | 2
106 | 2

From such a table we can get the median by grouping by bucket, and then selecting the maximum value of bucket 1:

```
```SELECT max(column) FROM t WHERE bucket = 1 GROUP BY bucket;
max
-----
27

Of course, we don't have tables with a magical bucket column. But Postgres' built-in `ntile(`

window function will make it for us.*number_of_buckets*)

#### Window Functions

Like an aggregate function, a window function computes values by looking at a bunch of rows. It can look at the whole table, in fact, and will unless some kind of partitioning is specified. However, unlike an aggregate function, it produces a value for each row. It doesn't group.

For example, Postgres provides a built-in window function to number rows:

```
```SELECT column, row_number() OVER () FROM t;
column | row_number
--------+------------
5 | 1
106 | 2
14 | 3
39 | 4
27 | 5

And I can use any aggregate function as a window function by providing an over clause. For example, to include a column's maximum value in each row:

```
```SELECT column, max(column) OVER () FROM t;
column | max
--------+-----
5 | 106
106 | 106
14 | 106
39 | 106
27 | 106

And finally, there's the window function we're interested in: `ntile`

. It takes a number of buckets as arguments. And it numbers rows in a way that breaks them up as evenly as possible into the specified number of buckets:

```
```SELECT column, ntile(2) OVER () FROM t;
column | ntile
--------+-------
5 | 1
106 | 1
14 | 1
39 | 2
27 | 2

##### Order By

The one missing piece for our median is making sure the rows are ordered by the column we're trying to find the median of. It's easy to overlook, and would ruin the median calculation highlighted above. The remedy: provide an `ORDER BY`

within the `OVER`

clause:

```
```SELECT column, ntile(2) OVER (ORDER BY column) FROM t;
column | ntile
--------+-------
5 | 1
14 | 1
27 | 1
39 | 2
106 | 2

#### Putting It All Together

```
```SELECT
max(column) AS median
FROM
(
SELECT
column,
ntile(2) OVER (ORDER BY column) AS bucket
FROM
table
) as t
WHERE bucket = 1
GROUP BY bucket;

Official Documentation about window functions

Official Documentation of built-in window functions

Pingback: ClearCove Software Inc. – Great summary for doing data analysis in PostgreSQL

Nice post!

Can you explain why you used “group by bucket”, looks like it is not necessary here.