SQL Window Functions
We will be using the following table for the examples in this section:
Output:
1
Becky A.
Software Engineer
7
R&D
150000
2
Dan B.
Software Engineer
7
R&D
150000
3
Sharon F.
Software Engineer
7
R&D
160000
4
Dan M.
Software Engineer
7
R&D
165000
5
Malcom S.
Software Engineer
7
R&D
170000
6
Tylar S.
Data Scientist
7
Ad Tech
150000
7
Sherman D.
3D Artist
6
Ads Visuals
120000
8
Jakob J.
3D Artist
6
Ads Visuals
120000
9
Lillia A.
Data Scientist
5
Ad Tech
170000
10
Brandon J.
Data Scientist
5
Ad Tech
170000
11
Scott K.
Manager
5
Ads Visuals
175000
12
Shirlee M.
Manager
3
R&D
185000
13
Yancy I.
3D Artist
3
Ads Visuals
85000
14
Oliver P.
3D Artist
3
Ads Visuals
80000
15
Sedar S.
Data Scientist
3
Ad Tech
165000
1. What is a Window Function?
Definition by PostgreSQL
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
2. WINDOWs with REGULAR/AGGREGATE Functions
The following query creates several new columns and display metrics, including total number of employees, total salary paid to employees, average salary, and so on, next to each row of employee salary column with respect to their departments.
Alternatively, we can also create the window and use it in the SELECT clause. In this case, we need to create the window after FROM or WHERE (if exists) clauses:
Output:
2.1. Running Total - Cumulative Sum
You can find the running total or cumulative sum easily with the Window Functions. The following query yields three different running totals based on conditions specified
Note that if we omit the ORDER BY clause then the query results in total salary. Additionally, depending on the Window Function's setup, employees with the same salary will have the same cumulative sum value.
2.2. Running Average
To obtain running averages we simply replace the SUM function with AVG function in the above query:
Output:
3. WINDOW Functions: ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, NTILE
3.1. ROW_NUMBER, RANK, DENSE_RANK
ROW_NUMBER()
displays the number of a given row and has the following features:
It starts with 1
Numbers the rows according to the
ORDER BY
part of the window statement.It does not require you to specify a variable within the parentheses
If
PARTITION BY
clause is used, then numbering begins from 1 for each partition
RANK()
is similar to ROW_NUMBER()
, i.e. it numbers the rows; however, it differs from ROW_NUMBER()
when it comes to numbering the rows with the same values. For instance, If we were to number the rows based on salary for each department, in other words we order by salary
, we might have employees with the same salaries for within certain departments. In this case, ROW_NUMBER()
gives them different numbers, whereas RANK()
gives them the same number, i.e. rank.
DENSE_RANK()
is the same as RANK()
, ie. it also ranks the rows. The difference lies on how they treat the trailing row after ranking the rows with the identical values. RANK()
and DENSE_RANK()
would give identical rows the same rank, then RANK()
would skip the proceeding value(s) and continue to rank the rows, whereas after ranking the identical rows with the same rank DENSE_RANK()
wouldn't skip any value(s). It is best to describe what each one does with an example.
In the following query, we order salaries in descending order per deparment and number each value using ROW_NUMBER()
, RANK()
, and DENSE_RANK()
. ROW_NUMBER()
does its job and numbers each row per department without any constraints. RANK()
, and DENSE_RANK()
on the other hand ranks the rows and considers values being equal. For instance, in the Ad Tech team two employees have the same salaries rows 2 and 3; therefore have the same ranks. The following 4th row, however, is ranked as 4 by the RANK()
and as 3 by the DENSE_RANK()
function:
Output:
In the above table, we also see that the same phenomenon is also observed for the Ads Visuals team as well.
There is also PERCENT_RANK
function which assigns a rank to each row within each partition according to ORDER BY , as a percentage. It allows us to draw conclusions as 80% of the employees have less than $170000 annual salary.
3.2. LEAD, LAG
Comparing rows with adjacent rows can be valuable, i.e. calculating differences between rows, especially when data is logically ordered. You can employ LAG
or LEAD
functions to generate new columns that retrieve values from other rows. Simply specify the column to retrieve from and the desired number of rows to pull. LAG
function retrieves from preceding rows, while LEAD
function retrieves from subsequent rows.
Output:
We don't need to specify the row number for neither of the functions, if we were to compare two consequent rows:
LAG(salary, 1) = LAG(salary)
The following query calculates the salary differences of two adjacent rows
Output:
3.3. NTILE
NTILE()
function helps us identify what quartile, quintile, percentile or other subdivisions a given row falls into.
It takes one argument: Number of tiles/bins/buckets
ORDER BY
clause determines which column to be the subdivided
Output:
Observing the query results, it becomes evident that the percentile column doesn't compute precisely as one might anticipate. In a scenario with only two records and percentile measurements, one would expect one record to represent the 1st percentile and the other to signify the 100th percentile. However, when utilizing the NTILE function, what you would observe is one record assigned to the 1st percentile and another to the 2nd percentile. Therefore, If your dataset is very small, consider using only quartiles or other small tiles.
For other Window Functions, please check PostgreSQL page.
4. Adjusting Window Size
So far, in all of our calculations we take the dataset as a whole. From time to time, we might be interested in at looking at smaller windows, i.e. calculating 7 day moving average of streaming performance of a song or playlist. In this case, we need to change the window size, which is done after the ORDER BY
clause as following:
ORDER BY
<variable>
ROWS
BETWEEN
window_start
AND
window_end
OVER(ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) -- moving monthly average in a year
4.1. Moving Averages
4.1.1. Excluding Current Row
To demonstrate, we will be calculating the moving average of 3 preceding salaries in the following query:
ORDER BY salary ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING clause essentially sorts the salary values in ascending order then looks at the previous 3 salaries (excluding the current salary) and take the average of those 3 values. If there are less than 3 values, i.e. 2 values only, the function will take the average of those 2 values.
4.1.2. Including Current Row
If we would like to calculate the moving averages including the current row we will then slightly adjust the above query to include the CURRENT ROW and decrease the preceding row count by 1
Last updated