SQL Window Functions

We will be using the following table for the examples in this section:

Employees Table
CREATE TABLE IF NOT EXISTS employees (
emp_id INTEGER PRIMARY KEY,
emp_name VARCHAR(100) NOT NULL,
emp_role VARCHAR(50) NOT NULL,
years_employed SMALLINT,
dept VARCHAR(50) NOT NULL,
salary INTEGER
);

INSERT INTO employees_full VALUES (1,'Becky A.', 'Software Engineer', 7, 'R&D',150000);
INSERT INTO employees_full VALUES (2,'Dan B.', 'Software Engineer', 7, 'R&D',150000);
INSERT INTO employees_full VALUES (3,'Sharon F.', 'Software Engineer', 7, 'R&D',160000);
INSERT INTO employees_full VALUES (4,'Dan M.', 'Software Engineer', 7, 'R&D',165000);
INSERT INTO employees_full VALUES (5,'Malcom S.', 'Software Engineer', 7, 'R&D',170000);
INSERT INTO employees_full VALUES (6,'Tylar S.', 'Data Scientist', 7, 'Ad Tech',150000);
INSERT INTO employees_full VALUES (7,'Sherman D.', '3D Artist', 6, 'Ads Visuals',120000);
INSERT INTO employees_full VALUES (8,'Jakob J.', '3D Artist', 6, 'Ads Visuals',120000);
INSERT INTO employees_full VALUES (9,'Lillia A.', 'Data Scientist', 5, 'Ad Tech',170000);
INSERT INTO employees_full VALUES (10,'Brandon J.', 'Data Scientist', 5, 'Ad Tech',170000);
INSERT INTO employees_full VALUES (11,'Scott K.', 'Manager', 5, 'Ad Tech',175000);
INSERT INTO employees_full VALUES (12,'Shirlee M.', 'Manager', 3, 'R&D',185000);
INSERT INTO employees_full VALUES (13,'Yancy I.', '3D Artist', 3, 'Ads Visuals',85000);
INSERT INTO employees_full VALUES (14,'Oliver P.', '3D Artist', 3, 'Ads Visuals',80000);
INSERT INTO employees_full VALUES (15,'Sedar S.', 'Data Scientist', 3, 'Ad Tech',165000);

Output:

emp_idemp_nameemp_roleyears_employeddeptsalary

1

Becky A.

Software Engineer

7

R&D

150000

2

Dan B.

Software Engineer

7

R&D

150000

3

Sharon F.

Software Engineer

7

R&D

160000

4

Dan M.

Software Engineer

7

R&D

165000

5

Malcom S.

Software Engineer

7

R&D

170000

6

Tylar S.

Data Scientist

7

Ad Tech

150000

7

Sherman D.

3D Artist

6

Ads Visuals

120000

8

Jakob J.

3D Artist

6

Ads Visuals

120000

9

Lillia A.

Data Scientist

5

Ad Tech

170000

10

Brandon J.

Data Scientist

5

Ad Tech

170000

11

Scott K.

Manager

5

Ads Visuals

175000

12

Shirlee M.

Manager

3

R&D

185000

13

Yancy I.

3D Artist

3

Ads Visuals

85000

14

Oliver P.

3D Artist

3

Ads Visuals

80000

15

Sedar S.

Data Scientist

3

Ad Tech

165000

1. What is a Window Function?

Definition by PostgreSQL

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

2. WINDOWs with REGULAR/AGGREGATE Functions


-- syntax: RegFunc(columnX) OVER (PARTITION BY columnY ORDER BY columnZ)
-- example:
COUNT(*) OVER (PARTITION BY column1 ORDER BY column2)

The following query creates several new columns and display metrics, including total number of employees, total salary paid to employees, average salary, and so on, next to each row of employee salary column with respect to their departments.

SELECT
    dept,
    emp_id,
    salary,
    COUNT(emp_id) OVER (PARTITION BY dept) AS num_employees_by_dept,
    SUM(salary) OVER (PARTITION BY dept) AS total_salary_by_dept,
    ROUND(AVG(salary) OVER (PARTITION BY dept),2) AS avg_salary_by_dept,
    MIN(salary) OVER (PARTITION BY dept) AS min_salary_by_dept,
    MAX(salary) OVER (PARTITION BY dept) AS max_salary_by_dept
FROM employees
ORDER BY dept;

Alternatively, we can also create the window and use it in the SELECT clause. In this case, we need to create the window after FROM or WHERE (if exists) clauses:

SELECT
    dept,
    emp_id,
    salary,
    COUNT(emp_id) OVER W AS num_employees_by_dept,
    SUM(salary) OVER W AS total_salary_by_dept,
    ROUND(AVG(salary) OVER W,2) AS avg_salary_by_dept,
    MIN(salary) OVER W AS min_salary_by_dept,
    MAX(salary) OVER W AS max_salary_by_dept
FROM employees
WINDOW W AS (PARTITION BY dept)
ORDER BY dept;

Output:

2.1. Running Total - Cumulative Sum

You can find the running total or cumulative sum easily with the Window Functions. The following query yields three different running totals based on conditions specified

SELECT
    dept,
    salary,
    SUM(salary) OVER (PARTITION BY dept ORDER BY salary) AS cum_sum_dept,
    SUM(salary) OVER (ORDER BY dept, salary) AS cum_sum_all,
    SUM(salary) OVER () AS total
FROM employees
ORDER BY 1;

Note that if we omit the ORDER BY clause then the query results in total salary

2.2. Running Average

To obtain running averages we simply replace the SUM function with AVG function in the above query

SELECT
    dept,
    salary,
    ROUND(AVG(salary) OVER (PARTITION BY dept ORDER BY salary),2) AS run_avg_dept,
    ROUND(AVG(salary) OVER (ORDER BY dept, salary),2) AS run_avg_all,
    ROUND(AVG(salary) OVER (),2) AS run_avg_accross
FROM employees
ORDER BY 1;

Output:

3. WINDOW Functions: ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, NTILE

3.1. ROW_NUMBER, RANK, DENSE_RANK

ROW_NUMBER() displays the number of a given row and has the following features:

  • It starts with 1

  • Numbers the rows according to the ORDER BY part of the window statement.

  • It does not require you to specify a variable within the parentheses

  • If PARTITION BY clause is used, then numbering begins from 1 for each partition

RANK() is similar to ROW_NUMBER() , i.e. it numbers the rows; however, it differs from ROW_NUMBER() when it comes to numbering the rows with the same values. For instance, If we were to number the rows based on salary for each department, in other words we order by salary, we might have employees with the same salaries for within certain departments. In this case, ROW_NUMBER() gives them different numbers, whereas RANK() gives them the same number, i.e. rank.

DENSE_RANK() is the same as RANK() , ie. it also ranks the rows. The difference lies on how they treat the trailing row after ranking the rows with the identical values. RANK() and DENSE_RANK() would give identical rows the same rank, then RANK() would skip the proceeding value(s) and continue to rank the rows, whereas after ranking the identical rows with the same rank DENSE_RANK() wouldn't skip any value(s). It is best to describe what each one does with an example.

In the following query, we order salaries in descending order per deparment and number each value using ROW_NUMBER(), RANK() , and DENSE_RANK(). ROW_NUMBER() does its job and numbers each row per department without any constraints. RANK() , and DENSE_RANK() on the other hand ranks the rows and considers values being equal. For instance, in the Ad Tech team two employees have the same salaries rows 2 and 3; therefore have the same ranks. The following 4th row, however, is ranked as 4 by the RANK() and as 3 by the DENSE_RANK() function:

SELECT 
  dept,
  salary,
  ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS row_number,
  RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rank,
  DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dense_rank
FROM employees;

Output:

In the above table, we also see that the same phenomenon is also observed for the Ads Visuals team as well.

There is also PERCENT_RANK function which assigns a rank to each row within each partition according to ORDER BY , as a percentage. It allows us to draw conclusions as 80% of the employees have less than $170000 annual salary.

SELECT 
  dept,
  salary,
  PERCENT_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS perc_rank
FROM employees_full
WHERE dept = 'R&D';

/*
dept    salary  perc_rank
R&D	185000	0
R&D	170000	0.2
R&D	165000	0.4
R&D	160000	0.6
R&D	150000	0.8
R&D	150000	0.8
*/

3.2. LEAD, LAG

Comparing rows with adjacent rows can be valuable, i.e. calculating differences between rows, especially when data is logically ordered. You can employ LAG or LEAD functions to generate new columns that retrieve values from other rows. Simply specify the column to retrieve from and the desired number of rows to pull. LAG function retrieves from preceding rows, while LEAD function retrieves from subsequent rows.

SELECT 
    dept,
    salary,
    LAG(salary, 1) OVER (PARTITION BY dept ORDER BY salary) AS lag1,
    LAG(salary, 2) OVER (PARTITION BY dept ORDER BY salary) AS lag2,
    LEAD(salary, 1) OVER (PARTITION BY dept ORDER BY salary) AS lead1,
    LEAD(salary, 2) OVER (PARTITION BY dept ORDER BY salary ) AS lead2
FROM employees;

Output:

We don't need to specify the row number for neither of the functions, if we were to compare two consequent rows:

LAG(salary, 1) = LAG(salary)

The following query calculates the salary differences of two adjacent rows

SELECT 
    dept,
    salary,
    LAG(salary) OVER (PARTITION BY dept ORDER BY salary) as previous_salary,
    salary - LAG(salary) OVER (PARTITION BY dept ORDER BY salary) AS difference
FROM employees;

Output:

3.3. NTILE

NTILE() function help us identify what quartile, quintile, percentile or other subdivisions a given row falls into.

  • It takes one argument: Number of tiles/bins/buckets

  • ORDER BY clause determines which column to be the subdivided

SELECT 
    dept,
    salary,
    NTILE(4) OVER (PARTITION BY dept ORDER BY salary) as quartile,
    NTILE(5) OVER (PARTITION BY dept ORDER BY salary) as quintile,
    NTILE(100) OVER (PARTITION BY dept ORDER BY salary) as percentile
FROM employees
ORDER BY dept, salary;

Output:

Observing the query results, it becomes evident that the percentile column doesn't compute precisely as one might anticipate. In a scenario with only two records and percentile measurements, one would expect one record to represent the 1st percentile and the other to signify the 100th percentile. However, when utilizing the NTILE function, what you would observe is one record assigned to the 1st percentile and another to the 2nd percentile. Therefore, If your dataset is very small, consider using only quartiles or other small tiles.

For other Window Functions, please check PostgreSQL page.

4. Adjusting Window Size

So far, in all of our calculations we take the dataset as a whole. From time to time, we might be interested in at looking at smaller windows, i.e. calculating 7 day moving average of streaming performance of a song or playlist. In this case, we need to change the window size, which is done after the ORDER BY clause as following:

ORDER BY <variable> ROWS BETWEEN window_start AND window_end

OVER(ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) -- moving monthly average in a year

4.1. Moving Averages

4.1.1. Excluding Current Row

To demonstrate, we will be calculating the moving average of 3 preceding salaries in the following query:

SELECT
    dept,
    salary,
    ROUND(AVG(salary) OVER (ORDER BY salary ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING),2) AS moving_avg_salary
FROM employees;

ORDER BY salary ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING clause essentially sorts the salary values in ascending order then looks at the previous 3 salaries (excluding the current salary) and take the average of those 3 values. If there are less than 3 values, i.e. 2 values only, the function will take the average of those 2 values.

4.1.2. Including Current Row

If we would like to calculate the moving averages including the current row we will then slightly adjust the above query to include the CURRENT ROW and decrease the preceding row count by 1

SELECT
    dept,
    salary,
    ROUND(AVG(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),2) AS moving_avg_salary
FROM employees;

Last updated