# SQL Window Functions

We will be using the following table for the examples in this section:

<details>

<summary>Employees Table</summary>

```sql
CREATE TABLE IF NOT EXISTS employees (
emp_id INTEGER PRIMARY KEY,
emp_name VARCHAR(100) NOT NULL,
emp_role VARCHAR(50) NOT NULL,
years_employed SMALLINT,
dept VARCHAR(50) NOT NULL,
salary INTEGER
);

INSERT INTO employees_full VALUES (1,'Becky A.', 'Software Engineer', 7, 'R&D',150000);
INSERT INTO employees_full VALUES (2,'Dan B.', 'Software Engineer', 7, 'R&D',150000);
INSERT INTO employees_full VALUES (3,'Sharon F.', 'Software Engineer', 7, 'R&D',160000);
INSERT INTO employees_full VALUES (4,'Dan M.', 'Software Engineer', 7, 'R&D',165000);
INSERT INTO employees_full VALUES (5,'Malcom S.', 'Software Engineer', 7, 'R&D',170000);
INSERT INTO employees_full VALUES (6,'Tylar S.', 'Data Scientist', 7, 'Ad Tech',150000);
INSERT INTO employees_full VALUES (7,'Sherman D.', '3D Artist', 6, 'Ads Visuals',120000);
INSERT INTO employees_full VALUES (8,'Jakob J.', '3D Artist', 6, 'Ads Visuals',120000);
INSERT INTO employees_full VALUES (9,'Lillia A.', 'Data Scientist', 5, 'Ad Tech',170000);
INSERT INTO employees_full VALUES (10,'Brandon J.', 'Data Scientist', 5, 'Ad Tech',170000);
INSERT INTO employees_full VALUES (11,'Scott K.', 'Manager', 5, 'Ad Tech',175000);
INSERT INTO employees_full VALUES (12,'Shirlee M.', 'Manager', 3, 'R&D',185000);
INSERT INTO employees_full VALUES (13,'Yancy I.', '3D Artist', 3, 'Ads Visuals',85000);
INSERT INTO employees_full VALUES (14,'Oliver P.', '3D Artist', 3, 'Ads Visuals',80000);
INSERT INTO employees_full VALUES (15,'Sedar S.', 'Data Scientist', 3, 'Ad Tech',165000);
```

</details>

Output:

<table><thead><tr><th width="108">emp_id</th><th>emp_name</th><th>emp_role</th><th width="155">years_employed</th><th width="125">dept</th><th>salary</th></tr></thead><tbody><tr><td>1</td><td>Becky A.</td><td>Software Engineer</td><td>7</td><td>R&#x26;D</td><td>150000</td></tr><tr><td>2</td><td>Dan B.</td><td>Software Engineer</td><td>7</td><td>R&#x26;D</td><td>150000</td></tr><tr><td>3</td><td>Sharon F.</td><td>Software Engineer</td><td>7</td><td>R&#x26;D</td><td>160000</td></tr><tr><td>4</td><td>Dan M.</td><td>Software Engineer</td><td>7</td><td>R&#x26;D</td><td>165000</td></tr><tr><td>5</td><td>Malcom S.</td><td>Software Engineer</td><td>7</td><td>R&#x26;D</td><td>170000</td></tr><tr><td>6</td><td>Tylar S.</td><td>Data Scientist</td><td>7</td><td>Ad Tech</td><td>150000</td></tr><tr><td>7</td><td>Sherman D.</td><td>3D Artist</td><td>6</td><td>Ads Visuals</td><td>120000</td></tr><tr><td>8</td><td>Jakob J.</td><td>3D Artist</td><td>6</td><td>Ads Visuals</td><td>120000</td></tr><tr><td>9</td><td>Lillia A.</td><td>Data Scientist</td><td>5</td><td>Ad Tech</td><td>170000</td></tr><tr><td>10</td><td>Brandon J.</td><td>Data Scientist</td><td>5</td><td>Ad Tech</td><td>170000</td></tr><tr><td>11</td><td>Scott K.</td><td>Manager</td><td>5</td><td>Ads Visuals</td><td>175000</td></tr><tr><td>12</td><td>Shirlee M.</td><td>Manager</td><td>3</td><td>R&#x26;D</td><td>185000</td></tr><tr><td>13</td><td>Yancy I.</td><td>3D Artist</td><td>3</td><td>Ads Visuals</td><td>85000</td></tr><tr><td>14</td><td>Oliver P.</td><td>3D Artist</td><td>3</td><td>Ads Visuals</td><td>80000</td></tr><tr><td>15</td><td>Sedar S.</td><td>Data Scientist</td><td>3</td><td>Ad Tech</td><td>165000</td></tr></tbody></table>

## 1. What is a Window Function?

Definition by [PostgreSQL](https://www.postgresql.org/docs/9.1/tutorial-window.html)&#x20;

> A ***window function*** performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

## 2. WINDOWs with REGULAR/AGGREGATE Functions

```sql
-- syntax: RegFunc(columnX) OVER (PARTITION BY columnY ORDER BY columnZ)
-- example:
COUNT(*) OVER (PARTITION BY column1 ORDER BY column2)

```

The following query creates several new columns and display metrics, including total number of employees, total salary paid to employees, average salary, and so on, next to each row of employee salary column with respect to their departments.

```sql
SELECT
    dept,
    emp_id,
    salary,
    COUNT(emp_id) OVER (PARTITION BY dept) AS num_employees_by_dept,
    SUM(salary) OVER (PARTITION BY dept) AS total_salary_by_dept,
    ROUND(AVG(salary) OVER (PARTITION BY dept),2) AS avg_salary_by_dept,
    MIN(salary) OVER (PARTITION BY dept) AS min_salary_by_dept,
    MAX(salary) OVER (PARTITION BY dept) AS max_salary_by_dept
FROM employees
ORDER BY dept;
```

Alternatively, we can also create the window and use it in the SELECT clause. In this case, we need to create the window after FROM or WHERE (if exists) clauses:

```sql
SELECT
    dept,
    emp_id,
    salary,
    COUNT(emp_id) OVER W AS num_employees_by_dept,
    SUM(salary) OVER W AS total_salary_by_dept,
    ROUND(AVG(salary) OVER W,2) AS avg_salary_by_dept,
    MIN(salary) OVER W AS min_salary_by_dept,
    MAX(salary) OVER W AS max_salary_by_dept
FROM employees
WINDOW W AS (PARTITION BY dept)
ORDER BY dept;
```

Output:

<figure><img src="/files/oIY3S1VFJ5EdFn8nnVvh" alt=""><figcaption></figcaption></figure>

### 2.1. Running Total - Cumulative Sum&#x20;

You can find the running total or cumulative sum easily with the Window Functions. The following query yields three different running totals based on conditions specified&#x20;

```sql
SELECT
    dept,
    salary,
    SUM(salary) OVER (PARTITION BY dept ORDER BY salary) AS cum_sum_dept,
    SUM(salary) OVER (ORDER BY dept, salary) AS cum_sum_all,
    SUM(salary) OVER () AS total
FROM employees_full
ORDER BY 1;
```

<figure><img src="/files/eVHQkiXrmekW2wQi54WO" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
Note that if we omit the ORDER BY clause then the query results in total salary. Additionally, depending on the Window Function's setup, employees with the same salary will have the same cumulative sum value.
{% endhint %}

### 2.2. Running Average

To obtain running averages we simply replace the SUM function with AVG function in the above query:

```sql
SELECT
    dept,
    salary,
    ROUND(AVG(salary) OVER (PARTITION BY dept ORDER BY salary),2) AS run_avg_dept,
    ROUND(AVG(salary) OVER (ORDER BY dept, salary),2) AS run_avg_all,
    ROUND(AVG(salary) OVER (),2) AS run_avg_accross
FROM employees
ORDER BY 1;
```

Output:

<figure><img src="/files/3o9VeGKLqFST0ZQoPfx0" alt="" width="563"><figcaption></figcaption></figure>

## 3. WINDOW Functions: ROW\_NUMBER, RANK, DENSE\_RANK, LEAD, LAG, NTILE

### 3.1. ROW\_NUMBER, RANK, DENSE\_RANK

`ROW_NUMBER()` displays the number of a given row and has the following features:

* It starts with 1 &#x20;
* Numbers the rows according to the `ORDER BY` part of the window statement. &#x20;
* It does not require you to specify a variable within the parentheses
* If `PARTITION BY` clause is used, then numbering begins from 1 for each partition

`RANK()` is similar to `ROW_NUMBER()` , i.e. it numbers the rows; however, it differs from `ROW_NUMBER()` when it comes to numbering the rows with the same values. For instance, If we were to number the rows based on salary for each department, in other words we order by `salary`, we might have employees with the same salaries for within certain departments. In this case, `ROW_NUMBER()` gives them different numbers, whereas `RANK()` gives them the same number, i.e. rank.

`DENSE_RANK()` is the same as `RANK()` , ie. it also ranks the rows. The difference lies on how they treat the trailing row after ranking the rows with the identical values.  `RANK()`  and  `DENSE_RANK()` would give identical rows the same rank, then  `RANK()` would skip the proceeding value(s) and continue to rank the rows, whereas after ranking the identical rows with the same rank `DENSE_RANK()` wouldn't skip any value(s). It is best to describe what each one does with an example.&#x20;

In the following query, we order salaries in descending order per deparment and number each value using `ROW_NUMBER()`, `RANK()` , and `DENSE_RANK()`. `ROW_NUMBER()` does its job and numbers each row per department without any constraints. `RANK()` , and `DENSE_RANK()` on the other hand ranks the rows and considers values being equal. For instance, in the Ad Tech team two employees have the same salaries rows 2 and 3; therefore have the same ranks. The following 4th row, however, is ranked as 4 by the  `RANK()` and as 3 by the `DENSE_RANK()` function:

```sql
SELECT 
  dept,
  salary,
  ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS row_number,
  RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rank,
  DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dense_rank
FROM employees;
```

Output:

<figure><img src="/files/7ayC5ljn9hbFhINMhYFZ" alt=""><figcaption></figcaption></figure>

In the above table, we also see that the same phenomenon is also observed for the Ads Visuals team as well.

{% hint style="info" %}
There is also `PERCENT_RANK`  function which assigns a rank to each row within each partition according to ORDER BY , as a percentage. It allows us to draw conclusions as 80% of the employees have less than $170000 annual salary.

&#x20;

```sql
SELECT 
  dept,
  salary,
  PERCENT_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS perc_rank
FROM employees
WHERE dept = 'R&D';

/*
dept    salary  perc_rank
R&D	185000	0
R&D	170000	0.2
R&D	165000	0.4
R&D	160000	0.6
R&D	150000	0.8
R&D	150000	0.8
*/
```

{% endhint %}

### 3.2. LEAD, LAG

Comparing rows with adjacent rows can be valuable, i.e. calculating differences between rows, especially when data is logically ordered. You can employ `LAG` or `LEAD` functions to generate new columns that retrieve values from other rows. Simply specify the column to retrieve from and the desired number of rows to pull. `LAG` function retrieves from preceding rows, while `LEAD` function retrieves from subsequent rows.

```sql
SELECT 
    dept,
    salary,
    LAG(salary, 1) OVER (PARTITION BY dept ORDER BY salary) AS lag1,
    LAG(salary, 2) OVER (PARTITION BY dept ORDER BY salary) AS lag2,
    LEAD(salary, 1) OVER (PARTITION BY dept ORDER BY salary) AS lead1,
    LEAD(salary, 2) OVER (PARTITION BY dept ORDER BY salary ) AS lead2
FROM employees;
```

Output:

<figure><img src="/files/DEiBbjX9b3ML7h1Dt0NK" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
We don't need to specify the row number for neither of the functions, if we were to compare two consequent rows:

**LAG(salary, 1) = LAG(salary)**
{% endhint %}

The following query calculates the salary differences of two adjacent rows

```sql
SELECT 
    dept,
    salary,
    LAG(salary) OVER (PARTITION BY dept ORDER BY salary) as previous_salary,
    salary - LAG(salary) OVER (PARTITION BY dept ORDER BY salary) AS difference
FROM employees;
```

Output:

<figure><img src="/files/K41pkmqYtu8y51WiCMBx" alt=""><figcaption></figcaption></figure>

### 3.3. NTILE <a href="#ntile" id="ntile"></a>

`NTILE()` function helps us identify what quartile, quintile, percentile or other subdivisions a given row falls into. &#x20;

* It takes one argument: Number of tiles/bins/buckets&#x20;
* `ORDER BY` clause determines which column to be the subdivided

```sql
SELECT 
    dept,
    salary,
    NTILE(4) OVER (PARTITION BY dept ORDER BY salary) as quartile,
    NTILE(5) OVER (PARTITION BY dept ORDER BY salary) as quintile,
    NTILE(100) OVER (PARTITION BY dept ORDER BY salary) as percentile
FROM employees
ORDER BY dept, salary;
```

Output:

<figure><img src="/files/FkB9jR6aw0yEniT3GNoN" alt=""><figcaption></figcaption></figure>

Observing the query results, it becomes evident that the percentile column doesn't compute precisely as one might anticipate. In a scenario with only two records and percentile measurements, one would expect one record to represent the 1st percentile and the other to signify the 100th percentile. However, when utilizing the NTILE function, what you would observe is one record assigned to the 1st percentile and another to the 2nd percentile. Therefore, If your dataset is very small, consider using only quartiles or other small tiles.

For other Window Functions, please check [PostgreSQL page](https://www.postgresql.org/docs/9.1/functions-window.html).

## 4. Adjusting Window Size

So far, in all of our calculations we take the dataset as a whole. From time to time, we might be interested in at looking at smaller windows, i.e. calculating 7 day moving average of streaming performance of a song or playlist. In this case, we need to change the window size, which is done after the `ORDER BY` clause as following:

**`ORDER BY`**` ``<variable>`` `**`ROWS`` ``BETWEEN`**`window_start`**`AND`**`window_end`

**`OVER(ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) -- moving monthly average in a year`**

### 4.1. Moving Averages

#### 4.1.1. Excluding Current Row

To demonstrate, we will be calculating the moving average of 3 preceding salaries in the following query:&#x20;

```sql
SELECT
    dept,
    salary,
    ROUND(AVG(salary) OVER (ORDER BY salary ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING),2) AS moving_avg_salary
FROM employees;
```

**ORDER BY salary** **ROWS** **BETWEEN** **3 PRECEDING** **AND** **1** **PRECEDING** clause essentially sorts the salary values in ascending order then looks at the previous 3 salaries (excluding the current salary) and take the average of those 3 values. If there are less than 3 values, i.e. 2 values only, the function will take the average of those 2 values.

<figure><img src="/files/3AOT0G8ZprIATY3LMXXt" alt="" width="375"><figcaption></figcaption></figure>

#### 4.1.2. Including Current Row

If we would like to calculate the moving averages including the current row we will then slightly  adjust the above query to include the CURRENT ROW and decrease the preceding row count by 1

```sql
SELECT
    dept,
    salary,
    ROUND(AVG(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),2) AS moving_avg_salary
FROM employees;
```

<figure><img src="/files/ZB1dKi8u4K0WixqDx953" alt="" width="375"><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dshub.gitbook.io/ds-hub/sql/sql-window-functions.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
