Aggregations occur only vertically. If you perform a calculation across rows, you will simply do basic arithmetic:
# Aggregation
SELECT
SUM(sneakers_sales) AS total_sneakers_sales_2022
FROM sales
WHERE sales_year = 2022
# Arithmetic
SELECT
sneakers_sales + running_shoes_sales AS shoe_sales_by_month
FROM sales
WHERE sales_year = 2022
Note: You would not need to worry as much about the presence of NULLs with SUM as you would with COUNT, as SUM treats nulls as 0.
AVG can only be used with numerical data and ignores NULLs completely.
The difference can be easier to understood in the context of an example:
-- Query 1
SELECT
AVG(low) as avg_low1,
SUM(low)/COUNT(*) as avg_low2 -- Counting all rows
FROM aapl_historical_stock_price;
-- NOTE: avg_low1 is not equal to avg_low2
-- Query 2
SELECT
AVG(low) as avg_low1,
SUM(low)/COUNT(*) as avg_low2 -- Counting all rows
FROM aapl_historical_stock_price
WHERE low IS NOT NULL;
-- NOTE: avg_low1 is equal to avg_low2
-- Query 3
SELECT
AVG(low) as avg_low1,
SUM(low)/COUNT(low) as avg_low2 -- Counting only rows in column 'low'
FROM aapl_historical_stock_price
-- NOTE: avg_low1 is equal to avg_low2
COUNT(*) vs COUNT(1)
You might have seen (or may see) usage of COUNT(1) but not COUNT(*)? Is there a difference between the two? NO, they’re exactly the same. (You can even use COUNT(999) and will get the same answer! Inside the COUNT() function the numbers are not treated as columns like they do with GROUP BY, HAVING, ORDER BY clauses). However, It is a better practice to use COUNT(*), as the COUNT(1) may cause confusion and it’s more commonly used
SELECT
MAX(close - open) AS highest_single_day_increase
FROM aapl_historical_stock_price;
2. CASE
Goes in the SELECT clause (can also be in GROUP BY)
Must include: WHEN, THEN, and END. ELSE is optional.
Multiple conditional statements may be added between WHEN and THEN using AND and OR.
Multiple WHEN statements, as well as an ELSE statement can be used to deal with any unaddressed conditions.
Create CASE statements that don't overlap.
SELECT
close,
CASE
WHEN close > 500 THEN 'over 500'
WHEN close > 250 AND close <= 500 THEN '251-500'
WHEN close > 100 AND close <= 250 THEN '101-250'
ELSE '100 or under'
END AS close_price_group
FROM aapl_historical_stock_price
A. CASE and AGGREGATION
The following query shows the number of songs breakdown by the a group's type. If the group consists of only one singer it is considered as 'Single', if it is a band then it is 'Band', and if it is a collaborative work of various artists, then it is categorized and 'Collaboration'.
SELECT
CASE
WHEN group_name NOT ILIKE '%BAND%'
AND group_name NOT ILIKE '%feat%'
AND group_name NOT ILIKE '%with%'
AND group_name NOT ILIKE '%and%'
AND group_name != 'Foster The People'
AND group_name != 'The Black Eyed Peas'
AND group_name != 'LMFAO'
AND group_name != 'OneRepublic'
AND group_name != 'Hot Chelle Rae'
AND group_name != 'Thompson Square'
THEN 'Single'
WHEN group_name ILIKE '%BAND%'
OR group_name = 'Foster The People'
OR group_name = 'The Black Eyed Peas'
OR group_name = 'LMFAO'
OR group_name = 'OneRepublic'
OR group_name = 'Hot Chelle Rae'
OR group_name = 'Thompson Square'
THEN 'Band'
ELSE 'Collaborations'
END AS group_type,
COUNT(DISTINCT group_name) AS num_artists
FROM billboard_top_100_year_end t
WHERE year=2011
GROUP BY 1;
Similar task with Apple's stock prices:
SELECT
CASE
WHEN close > 500 THEN 'over 500'
WHEN close > 250 AND close <= 500 THEN '251-500'
WHEN close > 100 AND close <= 250 THEN '101-250'
ELSE '100 or under'
END AS close_price_group,
COUNT(*) AS count
FROM aapl_historical_stock_price
GROUP BY 1;
3. PIVOTING
Pivoting is displaying a dataset horizontally that is originally in a vertical structure, or vice-versa.
3.1. VERTICAL to HORIZONTAL (Rows to Columns)
Horizontally re-orienting, i.e. pivoting, the above query :
SELECT
COUNT(CASE WHEN close > 500 THEN 1 ELSE NULL END) AS cnt_over500,
COUNT(CASE WHEN close > 250 AND close <= 500 THEN 1 ELSE NULL END) AS cnt_251_500,
COUNT(CASE WHEN close > 100 AND close <= 250 THEN 1 ELSE NULL END) AS cnt_101_250,
COUNT(CASE WHEN close <= 100 THEN 1 ELSE NULL END) AS cnt_100_under
FROM aapl_historical_stock_price;
Transitioning from horizontal to vertical orientation can be a substantially more difficult depending on the circumstances.
Another example school names. We will be displaying the number of schools with respect to their first letter in their names:
SELECT
CASE
WHEN school_name < 'n' THEN 'A-M'
WHEN school_name >= 'n' THEN 'N-Z'
ELSE NULL
END AS school_name_group,
COUNT(*) AS school_cnt
FROM usa_schools
GROUP BY 1
ORDER BY 1;
We can pivot the query as:
SELECT
COUNT(CASE WHEN LEFT(school_name,1) BETWEEN 'A' AND 'M' THEN 'A-M' ELSE NULL END) AS cnt_am,
COUNT(CASE WHEN LEFT(school_name,1) BETWEEN 'N' AND 'Z' THEN 'N-Z' ELSE NULL END) AS cnt_nz
FROM usa_schools;
3.2. HORIZONTAL TO VERTICAL (Columns to Rows)
Considering the structure of horizontal tables, we might not be able to answer higher level questions like what is the average sales of a certain group of products or the average number of earthquakes for a given time period. Let s put the latter case in an example. Consider we have a table called earthquakesand we would like to find the average number of earthquakes occurred on a year between 2000 and 2012:
We can answer the question as follows given the current structure of the table:
SELECT
ROUND(AVG(year_2000),2) AS avg_eq_cnt_2000,
ROUND(AVG(year_2001),2) AS avg_eq_cnt_2001,
ROUND(AVG(year_2002),2) AS avg_eq_cnt_2002,
ROUND(AVG(year_2003),2) AS avg_eq_cnt_2003,
ROUND(AVG(year_2004),2) AS avg_eq_cnt_2004,
ROUND(AVG(year_2005),2) AS avg_eq_cnt_2005,
ROUND(AVG(year_2006),2) AS avg_eq_cnt_2006,
ROUND(AVG(year_2007),2) AS avg_eq_cnt_2007,
ROUND(AVG(year_2008),2) AS avg_eq_cnt_2008,
ROUND(AVG(year_2009),2) AS avg_eq_cnt_2009,
ROUND(AVG(year_2010),2) AS avg_eq_cnt_2010,
ROUND(AVG(year_2011),2) AS avg_eq_cnt_2011,
ROUND(AVG(year_2012),2) AS avg_eq_cnt_2012
FROM earthquakes
WHERE magnitude <> 'No Magnitude';
Output:
Though the output is not too bad, it is not very easy to follow either. The better way could be having the data in only 3 columns showing the year, magnitude, and number of earthquakes in separate rows. Though it takes more effort to do such conversion, the outcome is worth it. The horizontal to vertical pivoting in our case will take three steps:
First we need to create a new table that lists all of the columns from the original table, i.e. earthquakes, as rows.
-- STEP 1
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year);
Then cross join it with the original table to create an expanded view. Notice that each row in the earthquakes table is replicated 13 times; the resulting table has 130 rows (13 yr x 10 magnitude categories) and 130rows x 16cols = 2080 records
SELECT years.*,
e.*
FROM earthquakes e
-- STEP 2
CROSS JOIN (
-- STEP 1
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years;
Finally, we use a CASE statement that pulls data from the correct column in the original table given the value in the year column:
SELECT
years.*,
e.magnitude,
-- STEP 3
CASE year -- important!
WHEN 2000 THEN year_2000
WHEN 2001 THEN year_2001
WHEN 2002 THEN year_2002
WHEN 2003 THEN year_2003
WHEN 2004 THEN year_2004
WHEN 2005 THEN year_2005
WHEN 2006 THEN year_2006
WHEN 2007 THEN year_2007
WHEN 2008 THEN year_2008
WHEN 2009 THEN year_2009
WHEN 2010 THEN year_2010
WHEN 2011 THEN year_2011
WHEN 2012 THEN year_2012
ELSE NULL
END AS earthquake_cnt
FROM earthquakes e
-- STEP 2
CROSS JOIN (
-- STEP 1
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years
And our table is ready! Below is an excerpt from the query:
Using a VIEW or by storing our new table in database, we can answer the question. above. The following query stores the new table as earthquakes_vertical:
SELECT year, magnitude, earthquake_cnt INTO earthquakes_vertical
FROM (
SELECT years.*,
e.magnitude,
-- STEP 3
CASE year -- important!
WHEN 2000 THEN year_2000
WHEN 2001 THEN year_2001
WHEN 2002 THEN year_2002
WHEN 2003 THEN year_2003
WHEN 2004 THEN year_2004
WHEN 2005 THEN year_2005
WHEN 2006 THEN year_2006
WHEN 2007 THEN year_2007
WHEN 2008 THEN year_2008
WHEN 2009 THEN year_2009
WHEN 2010 THEN year_2010
WHEN 2011 THEN year_2011
WHEN 2012 THEN year_2012
ELSE NULL
END AS earthquake_cnt
FROM earthquakes e
-- STEP 2
CROSS JOIN (
-- STEP 1
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years
) t;
It is time to answer the question:
SELECT
year,
ROUND(AVG(earthquake_cnt),2) avg_earthquake_cnt
FROM earthquakes_vertical
WHERE magnitude <> 'No Magnitude'
GROUP BY 1
ORDER BY 1;
Output:
The same result, but in a better format!
4. AGGREAGATION vs ARITHMETIC vs PIVOTING
Let's combine what we've seen so far. The following query displays the number of players in each state, based on their year in college, in terms of FR, SO, JR, and SR, in separate columns and another column that sums the the total number of players. Results are ordered by states that have the most players to the least number of players.
SELECT
state,
COUNT(CASE WHEN year = 'FR' THEN 1 ELSE NULL END) AS fr_cnt,
COUNT(CASE WHEN year = 'SO' THEN 1 ELSE NULL END) AS so_cnt,
COUNT(CASE WHEN year = 'JR' THEN 1 ELSE NULL END) AS jr_cnt,
COUNT(CASE WHEN year = 'SR' THEN 1 ELSE NULL END) AS sr_cnt,
(COUNT(CASE WHEN year = 'FR' THEN 1 ELSE NULL END) +
COUNT(CASE WHEN year = 'SO' THEN 1 ELSE NULL END) +
COUNT(CASE WHEN year = 'JR' THEN 1 ELSE NULL END) +
COUNT(CASE WHEN year = 'SR' THEN 1 ELSE NULL END)) AS total_players
FROM college_football_players
GROUP BY 1
ORDER BY 6 DESC;
The arithmetic shown in the last column is redundant, simply because counting all rows based on the states will suffice. Therefore, last calculation can be replaced by the following line as long as we aggregate on state:
COUNT(*) AS total_players
The query becomes:
SELECT
state,
COUNT(CASE WHEN year = 'FR' THEN 1 ELSE NULL END) AS fr_cnt,
COUNT(CASE WHEN year = 'SO' THEN 1 ELSE NULL END) AS so_cnt,
COUNT(CASE WHEN year = 'JR' THEN 1 ELSE NULL END) AS jr_cnt,
COUNT(CASE WHEN year = 'SR' THEN 1 ELSE NULL END) AS sr_cnt,
COUNT(*) AS total_players
FROM college_football_players
GROUP BY 1
ORDER BY total_players DESC;
5. DISTINCT Values
In real-world scenarios, you will probably need to perform several preliminary queries to be able find out the best way for answering your initial question. Looking at the unique values on each column can help identify how you might want to group or filter the data. The following query counts the number of unique months per year:
SELECT
year,
COUNT(DISTINCT month) as cnt_unique_mths
FROM aapl_historical_stock_price
GROUP BY 1 ORDER BY 1;
DISTINCT is only added once in theSELECTclause, i.e. there is no need to add it for each column. The results will show all of the unique values for those columns
6. CASTING: Changing Data Types
Can achieve this with two different type of syntax:
CAST(column_name AS integer)
column_name::integer
which produce the same result.
replace integer with any other data type that would make sense for that column—all values in a given column must fit with the new data types.
column_name::varchar
7. JOINs vs Subquery
-- JOIN
SELECT *
FROM playlist p1
JOIN ( SELECT DISTINCT date
FROM playlist
ORDER BY date DESC
LIMIT 5
) p2
ON p1.date = p2.date
-- Subquery
SELECT *
FROM playlist
WHERE date IN (SELECT DISTINCT date
FROM playlist
ORDER BY date DESC
LIMIT 5)
8. COALESCE: Dealing with NULL values:
The COALESCE function is used in SQL to return the first non-null value in a list of expressions. This is particularly useful for dealing with scenarios where data may be incomplete, and you want to use a backup value when nulls are present.
Syntax:
-- replace NULLs with numeric or string data
COALESCE(column2Bupdated, value2add) -- (sales, 0) or (content, 'N/A')F
Example usage:
SELECT
COALESCE(column1, column2, 'default_value') AS result
FROM my_table;
In this example, result will be the value of column1 if it's not null, otherwise column2, and if both are null, it will be 'default value'. COALESCE is used especially for numerical datasets and when performing Outer JOINs to replace nulls with 0s due to unmatched rows.