SQL Descriptive Stats
Descriptive statistics provide a way to summarize and understand the main characteristics of a data set. PostgreSQL offers several functions that help perform descriptive statistical analysis directly within SQL. Some of these functions are PERCENTILE_DISC
, PERCENTILE_CONT
, and MODE
, along with more common statistical/aggregate functions like AVG
, SUM
, MIN
, and MAX
.
Understanding Descriptive Statistics
Descriptive statistics describe the main features of a collection of data quantitatively. They are used to summarize data sets, and they include measures such as:
Count - number of rows
Mean (Average) - average value of a numeric column
Sum - total of a numeric column
Minimum and Maximum Values - smallest and largest value of the selected column
Percentiles - are measures that divide a dataset into 100 equal parts
Mode - value that appears most frequently
Setting Up Your Data
Let's consider a simple table named sales
that contains sales data:
Number of Rows
To COUNT()
function yields the number of rows in a table.
Example:
This query returns the total number of rows in the sales
table.
Mean (Average)
The AVG()
function calculates the mean of a numeric column.
Example:
This query returns the average value of the amount
column in the sales
table.
Sum
The SUM()
function calculates the total sum of a numeric column.
Example:
This query returns the total sum of the amount
column in the sales
table.
Minimum and Maximum Values
The MIN()
and MAX()
functions return the smallest and largest values in a column, respectively.
Example:
This query returns the smallest and largest values in the amount
column in the sales
table.
Percentiles
Percentiles are measures that divide a dataset into 100 equal parts. PostgreSQL provides two functions for calculating percentiles: PERCENTILE_DISC
and PERCENTILE_CONT
.
PERCENTILE_DISC: Discrete percentile calculation.
PERCENTILE_CONT: Continuous percentile calculation.
Both functions are used with the WITHIN GROUP
clause.
Example:
This query calculates the median (50th percentile) of the amount
column using both discrete and continuous methods.
The PERCENTILE_DISC()
function returns a value from the input dataset that is the closest to the percentile requested. The value returned will actually exist in the set.
The PERCENTILE_CONT()
function returns an interpolated value between multiple values based on the distribution. The value returned may or may not exist in the set.
When to Use Which?
PERCENTILE_DISC: Use when the exact value from the dataset is important, such as when working with categorical data or when you need an actual observation.
PERCENTILE_CONT: Use when a more precise value is needed, such as when working with continuous data, and the percentile may not correspond directly to an actual observation in the dataset.
Mode
The mode is the value that appears most frequently in a dataset. Similar to percentile functions, the MODE()
function is also used with the WITHIN GROUP
clause.
Example:
This query returns the mode of the amount
column in the sales
table.
Combining Descriptive Statistics
You can combine multiple descriptive statistics in a single query to get a comprehensive summary of your data.
Example:
This query provides a complete summary of the amount
column, including the average, total sum, minimum, maximum, median (both discrete and continuous), and mode.
Conclusion
Descriptive statistics in PostgreSQL can be efficiently performed using built-in SQL functions. These functions help you summarize and understand your data directly within the database. By utilizing functions like AVG
, SUM
, MIN
, MAX
, PERCENTILE_DISC
, PERCENTILE_CONT
, and MODE
, you can perform a comprehensive statistical analysis of your data sets. Understanding and applying these functions will enhance your data analysis capabilities in PostgreSQL.
Last updated