# SQL Descriptive Stats

Descriptive statistics provide a way to summarize and understand the main characteristics of a data set. PostgreSQL offers several functions that help perform descriptive statistical analysis directly within SQL. Some of these functions are `PERCENTILE_DISC`, `PERCENTILE_CONT`, and `MODE`, along with more common statistical/[aggregate functions](https://dshub.gitbook.io/ds-hub/sql/sql-aggregate-functions) like `AVG`, `SUM`, `MIN`, and `MAX`.

## **Understanding Descriptive Statistics**

Descriptive statistics describe the main features of a collection of data quantitatively. They are used to summarize data sets, and they include measures such as:

1. **Count -** number of rows
2. **Mean** (Average) - average value of a numeric column
3. **Sum -** total of a numeric column
4. **Minimum and Maximum Values -** smallest and largest value of the selected column
5. **Percentiles -** are measures that divide a dataset into 100 equal parts
6. **Mode - v**alue that appears most frequently

### **Setting Up Your Data**

Let's consider a simple table named `sales` that contains sales data:

```sql
CREATE TABLE sales (
    id SERIAL PRIMARY KEY,
    amount NUMERIC
);

INSERT INTO sales (amount) VALUES
(10.0), (20.0), (10.0), (50.0), (30.0), (20.0), (40.0), (50.0), (30.0), (60.0);
```

**Number of Rows**

To `COUNT()` function yields the number of rows in a table.&#x20;

**Example:**

```sql
SELECT COUNT(*) AS total_rows
FROM sales;
```

This query returns the total number of rows in the `sales` table.

**Mean (Average)**

The `AVG()` function calculates the mean of a numeric column.

**Example:**

```sql
SELECT AVG(amount) AS average_sale
FROM sales;
```

This query returns the average value of the `amount` column in the `sales` table.

**Sum**

The `SUM()` function calculates the total sum of a numeric column.

**Example:**

```sql
SELECT SUM(amount) AS total_sales
FROM sales;
```

This query returns the total sum of the `amount` column in the `sales` table.

**Minimum and Maximum Values**

The `MIN()` and `MAX()` functions return the smallest and largest values in a column, respectively.

**Example:**

```sql
SELECT MIN(amount) AS smallest_sale, MAX(amount) AS largest_sale
FROM sales;
```

This query returns the smallest and largest values in the `amount` column in the `sales` table.

**Percentiles**

Percentiles are measures that divide a dataset into 100 equal parts. PostgreSQL provides two functions for calculating percentiles: `PERCENTILE_DISC` and `PERCENTILE_CONT`.

* **PERCENTILE\_DISC**: Discrete percentile calculation.
* **PERCENTILE\_CONT**: Continuous percentile calculation.

Both functions are used with the `WITHIN GROUP` clause.

**Example:**

```sql
SELECT
    PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY amount) AS median_disc,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_cont
FROM sales;
```

This query calculates the median (50th percentile) of the `amount` column using both discrete and continuous methods.

{% hint style="info" %}
The `PERCENTILE_DISC()` function returns a value from the input dataset that is the closest to the percentile requested. The value returned will actually exist in the set.&#x20;

The `PERCENTILE_CONT()` function returns an interpolated value between multiple values based on the distribution. The value returned may or may not exist in the set.&#x20;
{% endhint %}

When to Use Which?

* **PERCENTILE\_DISC**: Use when the exact value from the dataset is important, such as when working with categorical data or when you need an actual observation.
* **PERCENTILE\_CONT**: Use when a more precise value is needed, such as when working with continuous data, and the percentile may not correspond directly to an actual observation in the dataset.

**Mode**

The mode is the value that appears most frequently in a dataset. Similar to percentile functions, the `MODE()` function is also used with the `WITHIN GROUP` clause.

**Example:**

```sql
SELECT MODE() WITHIN GROUP (ORDER BY amount) AS mode_amount
FROM sales;
```

This query returns the mode of the `amount` column in the `sales` table.

## **Combining Descriptive Statistics**

You can combine multiple descriptive statistics in a single query to get a comprehensive summary of your data.

**Example:**

```sql
SELECT
    AVG(amount) AS average_sale,
    SUM(amount) AS total_sales,
    MIN(amount) AS smallest_sale,
    MAX(amount) AS largest_sale,
    PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY amount) AS median_disc,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_cont,
    MODE() WITHIN GROUP (ORDER BY amount) AS mode_amount
FROM sales;
```

This query provides a complete summary of the `amount` column, including the average, total sum, minimum, maximum, median (both discrete and continuous), and mode.

## **Conclusion**

Descriptive statistics in PostgreSQL can be efficiently performed using built-in SQL functions. These functions help you summarize and understand your data directly within the database. By utilizing functions like `AVG`, `SUM`, `MIN`, `MAX`, `PERCENTILE_DISC`, `PERCENTILE_CONT`, and `MODE`, you can perform a comprehensive statistical analysis of your data sets. Understanding and applying these functions will enhance your data analysis capabilities in PostgreSQL.
