Window Functions Explained Without the Headache

Developer & creator of pgref.dev · 13 min read

window-functionsrow_numberrank

Window functions clicked for me the day I stopped reading tutorials and started with a problem: I needed the second-highest salary in each department. Not the highest — the second. A subquery approach worked but required two passes through the data. A window function did it in one. That's where we'll start.

The dataset

Every example in this post uses the same table. No switching datasets mid-article.

CREATE TABLE employees (
  id INT PRIMARY KEY,
  name TEXT,
  department TEXT,
  salary NUMERIC(10,2),
  hire_date DATE
);

INSERT INTO employees VALUES
  (1,  'Alice Chen',    'Engineering', 95000, '2019-03-15'),
  (2,  'Bob Kowalski',  'Engineering', 88000, '2020-07-22'),
  (3,  'Carmen Diaz',   'Engineering', 95000, '2018-11-01'),
  (4,  'David Park',    'Engineering', 72000, '2022-01-10'),
  (5,  'Emma Fischer',  'Marketing',   68000, '2020-04-30'),
  (6,  'Frank Osei',    'Marketing',   71000, '2019-09-12'),
  (7,  'Grace Liu',     'Marketing',   68000, '2021-06-15'),
  (8,  'Henry Moore',   'Sales',       82000, '2018-02-28'),
  (9,  'Isla MacLeod',  'Sales',       79000, '2019-10-05'),
  (10, 'James Okonkwo', 'Sales',       82000, '2020-12-01');

ROW_NUMBER — the gateway

The problem: get the highest-paid employee in each department.

SELECT * FROM (
  SELECT
    *,
    row_number() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
  FROM employees
) ranked
WHERE rn = 1;

How to read row_number() OVER (PARTITION BY department ORDER BY salary DESC):

OVER tells PostgreSQL this is a window function, not a regular aggregate
PARTITION BY department means "restart the counter for each department" — like sorting papers into piles, one pile per department
ORDER BY salary DESC means "within each pile, sort by salary descending, then number the rows"

Notice Alice gets rank 1 in Engineering even though Carmen has the same salary. With row_number(), ties are broken arbitrarily unless you add a tiebreaker: ORDER BY salary DESC, id ASC.

ROW_NUMBER vs RANK vs DENSE_RANK

SELECT
  name, department, salary,
  row_number() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
  rank()       OVER (PARTITION BY department ORDER BY salary DESC) AS rnk,
  dense_rank() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rnk
FROM employees
WHERE department = 'Engineering'
ORDER BY salary DESC;

Result:

    name      | salary  | row_num | rnk | dense_rnk
--------------+---------+---------+-----+-----------
 Alice Chen   | 95000   |    1    |  1  |     1
 Carmen Diaz  | 95000   |    2    |  1  |     1
 Bob Kowalski | 88000   |    3    |  3  |     2
 David Park   | 72000   |    4    |  4  |     3

row_number(): always unique — Alice is 1, Carmen is 2, even though they're tied
rank(): tied rows get the same rank and the next rank skips — there's no rank 2
dense_rank(): tied rows get the same rank, but the next rank is consecutive

For deduplication, always use row_number(). For "second highest salary" use dense_rank() — two people tying for first should still show a second place.

The frame clause — where most tutorials lose people

Every window function has a frame: the set of rows it operates over within its partition. The default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Running total of salary by hire date:

SELECT
  name, hire_date, salary,
  SUM(salary) OVER (ORDER BY hire_date) AS running_total
FROM employees
ORDER BY hire_date;

For a 3-row moving average using ROWS (physical rows, not value ranges):

SELECT
  name, hire_date, salary,
  AVG(salary) OVER (
    ORDER BY hire_date
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS moving_avg_3
FROM employees
ORDER BY hire_date;

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW: for each row, look at this row and the two rows before it by position. Use ROWS when you care about physical row position. Use RANGE when you care about value equality. For most practical uses (rolling averages, running totals), ROWS is what you want.

LAG and LEAD — looking sideways

LAG(col, n) looks backwards n rows. LEAD(col, n) looks forwards n rows. Month-over-month salary changes:

SELECT
  name, hire_date, salary,
  LAG(salary) OVER (ORDER BY hire_date) AS prev_salary,
  salary - LAG(salary) OVER (ORDER BY hire_date) AS change
FROM employees
ORDER BY hire_date;

The first row has NULL for prev_salary. Provide a default with LAG(salary, 1, 0) OVER (...) — the third argument is what to return when there's no preceding row.

The equivalent self-join is ugly: two table scans, a correlated subquery. Window functions exist partly to avoid this.

The pattern I use most

ROW_NUMBER with a CTE for deduplication. When a table has duplicate rows and I want to keep the most recent one:

WITH ranked AS (
  SELECT
    *,
    row_number() OVER (
      PARTITION BY customer_id
      ORDER BY created_at DESC
    ) AS rn
  FROM orders
)
DELETE FROM orders
WHERE id IN (
  SELECT id FROM ranked WHERE rn > 1
);

This keeps one row per customer_id (the most recent), deletes the rest. Edge case: if two rows have the exact same created_at, which one survives is arbitrary. Add a tiebreaker (ORDER BY created_at DESC, id ASC) to make it deterministic.

Bookmark the pgref.dev window functions page. You'll need it once a week, and you will forget the frame syntax every time. I still look it up.