SQL Tips

1. AGGREGATION

Aggregations occur only vertically. If you perform a calculation across rows, you will simply do basic arithmetic:

# Aggregation
SELECT 
    SUM(sneakers_sales) AS total_sneakers_sales_2022
FROM sales
WHERE sales_year = 2022 


# Arithmetic
SELECT 
    sneakers_sales + running_shoes_sales AS shoe_sales_by_month
FROM sales
WHERE sales_year = 2022 

Note: You would not need to worry as much about the presence of NULLs with SUM as you would with COUNT, as SUM treats nulls as 0.

AVG can only be used with numerical data and ignores NULLs completely.

The difference can be easier to understood in the context of an example:

-- Query 1
SELECT 
    AVG(low) as avg_low1, 
    SUM(low)/COUNT(*) as avg_low2 -- Counting all rows
FROM aapl_historical_stock_price;
-- NOTE: avg_low1 is not equal to avg_low2

-- Query 2
SELECT 
    AVG(low) as avg_low1, 
    SUM(low)/COUNT(*) as avg_low2 -- Counting all rows
FROM aapl_historical_stock_price
WHERE low IS NOT NULL;
-- NOTE: avg_low1 is equal to avg_low2

-- Query 3
SELECT 
    AVG(low) as avg_low1, 
    SUM(low)/COUNT(low) as avg_low2 -- Counting only rows in column 'low'
FROM aapl_historical_stock_price
-- NOTE: avg_low1 is equal to avg_low2

COUNT(*) vs COUNT(1)

You might have seen (or may see) usage of COUNT(1) but not COUNT(*)? Is there a difference between the two? NO, they’re exactly the same. (You can even use COUNT(999) and will get the same answer! Inside the COUNT() function the numbers are not treated as columns like they do with GROUP BY, HAVING, ORDER BY clauses). However, It is a better practice to use COUNT(*), as the COUNT(1) may cause confusion and it’s more commonly used

SELECT 
    MAX(close - open) AS highest_single_day_increase 
FROM aapl_historical_stock_price;

2. CASE

  1. Goes in the SELECT clause (can also be in GROUP BY)

  2. Must include: WHEN, THEN, and END. ELSE is optional.

  3. Multiple conditional statements may be added between WHEN and THEN using AND and OR.

  4. Multiple WHEN statements, as well as an ELSE statement can be used to deal with any unaddressed conditions.

Create CASE statements that don't overlap.

SELECT 
     close,
     CASE 
          WHEN close > 500 THEN 'over 500'
          WHEN close > 250 AND close <= 500 THEN '251-500'
          WHEN close > 100 AND close <= 250 THEN '101-250'
          ELSE '100 or under' 
     END AS close_price_group
FROM aapl_historical_stock_price

A. CASE and AGGREGATION

The following query shows the number of songs breakdown by the a group's type. If the group consists of only one singer it is considered as 'Single', if it is a band then it is 'Band', and if it is a collaborative work of various artists, then it is categorized and 'Collaboration'.

SELECT
CASE
    WHEN group_name NOT ILIKE '%BAND%'
    AND group_name NOT ILIKE '%feat%'
    AND group_name NOT ILIKE '%with%'
    AND group_name NOT ILIKE '%and%'
    AND group_name != 'Foster The People' 
    AND group_name != 'The Black Eyed Peas' 
    AND group_name != 'LMFAO' 
    AND group_name != 'OneRepublic' 
    AND group_name != 'Hot Chelle Rae' 
    AND group_name != 'Thompson Square' 
    THEN 'Single'
    WHEN group_name ILIKE '%BAND%' 
    OR group_name = 'Foster The People' 
    OR group_name = 'The Black Eyed Peas' 
    OR group_name = 'LMFAO' 
    OR group_name = 'OneRepublic' 
    OR group_name = 'Hot Chelle Rae' 
    OR group_name = 'Thompson Square' 
    THEN 'Band'
    ELSE 'Collaborations'
END AS group_type,
COUNT(DISTINCT group_name) AS num_artists
FROM billboard_top_100_year_end t
WHERE year=2011
GROUP BY 1;

Similar task with Apple's stock prices:

SELECT 
CASE 
    WHEN close > 500 THEN 'over 500' 
    WHEN close > 250 AND close <= 500 THEN '251-500' 
    WHEN close > 100 AND close <= 250 THEN '101-250' 
    ELSE '100 or under' 
END AS close_price_group, 
COUNT(*) AS count
FROM aapl_historical_stock_price 
GROUP BY 1;

3. PIVOTING

Pivoting is displaying a dataset horizontally that is originally in a vertical structure, or vice-versa.

3.1. VERTICAL to HORIZONTAL (Rows to Columns)

Horizontally re-orienting, i.e. pivoting, the above query :

SELECT 
    COUNT(CASE WHEN close > 500 THEN 1 ELSE NULL END) AS cnt_over500,
    COUNT(CASE WHEN close > 250 AND close <= 500 THEN 1 ELSE NULL END) AS cnt_251_500,
    COUNT(CASE WHEN close > 100 AND close <= 250 THEN 1 ELSE NULL END) AS cnt_101_250,
    COUNT(CASE WHEN close <= 100 THEN 1 ELSE NULL END) AS cnt_100_under
FROM aapl_historical_stock_price;

Transitioning from horizontal to vertical orientation can be a substantially more difficult depending on the circumstances.

Another example school names. We will be displaying the number of schools with respect to their first letter in their names:

SELECT 
    CASE 
        WHEN school_name < 'n' THEN 'A-M' 
        WHEN school_name >= 'n' THEN 'N-Z' 
        ELSE NULL 
    END AS school_name_group, 
    COUNT(*) AS school_cnt 
FROM usa_schools 
GROUP BY 1 
ORDER BY 1;

We can pivot the query as:

SELECT 
COUNT(CASE WHEN LEFT(school_name,1) BETWEEN 'A' AND 'M' THEN 'A-M' ELSE NULL END) AS cnt_am, 
COUNT(CASE WHEN LEFT(school_name,1) BETWEEN 'N' AND 'Z' THEN 'N-Z' ELSE NULL END) AS cnt_nz 
FROM usa_schools;

3.2. HORIZONTAL TO VERTICAL (Columns to Rows)

Considering the structure of horizontal tables, we might not be able to answer higher level questions like what is the average sales of a certain group of products or the average number of earthquakes for a given time period. Let s put the latter case in an example. Consider we have a table called earthquakes and we would like to find the average number of earthquakes occurred on a year between 2000 and 2012:

Earthquakes Table
CREATE TABLE IF NOT EXISTS earthquakes (
magnitude VARCHAR(15),
year_2000 INTEGER,
year_2001 INTEGER,
year_2002 INTEGER,
year_2003 INTEGER,
year_2004 INTEGER,
year_2005 INTEGER,
year_2006 INTEGER,
year_2007 INTEGER,
year_2008 INTEGER,
year_2009 INTEGER,
year_2010 INTEGER,
year_2011 INTEGER,
year_2012 INTEGER,
id SERIAL PRIMARY KEY
);

INSERT INTO earthquakes VALUES ('8.0 to 9.9', 1, 1, 0, 1, 2, 1, 2, 4, 0, 1, 1, 1, 2, 1);
INSERT INTO earthquakes VALUES ('7.0 to 7.9', 14, 15, 13, 14, 14, 10, 9, 14, 12, 16, 23, 19, 12, 2);
INSERT INTO earthquakes VALUES ('6.0 to 6.9', 146, 121, 127, 140, 141, 140, 142, 178, 168, 144, 150, 185, 108, 3);
INSERT INTO earthquakes VALUES ('5.0 to 5.9', 1344, 1224, 1201, 1203, 1515, 1693, 1712, 2074, 1768, 1896, 2209, 2276, 1401, 4);
INSERT INTO earthquakes VALUES ('4.0 to 4.9', 8008, 7991, 8541, 8462, 10888, 13917, 12838, 12078, 12291, 6805, 10164, 13315, 9534, 5);
INSERT INTO earthquakes VALUES ('3.0 to 3.9', 4827, 6266, 7068, 7624, 7932, 9191, 9990, 9889, 11735, 2905, 4341, 2791, 2453, 6);
INSERT INTO earthquakes VALUES ('2.0 to 2.9', 3765, 4164, 6419, 7727, 6316, 4636, 4027, 3597, 3860, 3014, 4626, 3643, 3111, 7);
INSERT INTO earthquakes VALUES ('1.0 to 1.9', 1026, 944, 1137, 2506, 1344, 26, 18, 42, 21, 26, 39, 47, 43, 8);
INSERT INTO earthquakes VALUES ('0.1 to 0.9', 5, 1, 10, 134, 103, 0, 2, 2, 0, 1, 0, 1, 0, 9);
INSERT INTO earthquakes VALUES ('No Magnitude', 3120, 2807, 2938, 3608, 2939, 864, 828, 1807, 1922, 17, 24, 11, 3, 10);

We can answer the question as follows given the current structure of the table:

SELECT
 ROUND(AVG(year_2000),2) AS avg_eq_cnt_2000,
 ROUND(AVG(year_2001),2) AS avg_eq_cnt_2001,
 ROUND(AVG(year_2002),2) AS avg_eq_cnt_2002,
 ROUND(AVG(year_2003),2) AS avg_eq_cnt_2003,
 ROUND(AVG(year_2004),2) AS avg_eq_cnt_2004,
 ROUND(AVG(year_2005),2) AS avg_eq_cnt_2005,
 ROUND(AVG(year_2006),2) AS avg_eq_cnt_2006,
 ROUND(AVG(year_2007),2) AS avg_eq_cnt_2007,
 ROUND(AVG(year_2008),2) AS avg_eq_cnt_2008,
 ROUND(AVG(year_2009),2) AS avg_eq_cnt_2009,
 ROUND(AVG(year_2010),2) AS avg_eq_cnt_2010,
 ROUND(AVG(year_2011),2) AS avg_eq_cnt_2011,
 ROUND(AVG(year_2012),2) AS avg_eq_cnt_2012
FROM earthquakes
WHERE magnitude <> 'No Magnitude';

Output:

Though the output is not too bad, it is not very easy to follow either. The better way could be having the data in only 3 columns showing the year, magnitude, and number of earthquakes in separate rows. Though it takes more effort to do such conversion, the outcome is worth it. The horizontal to vertical pivoting in our case will take three steps:

  1. First we need to create a new table that lists all of the columns from the original table, i.e. earthquakes, as rows.

-- STEP 1
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
             (2007),(2008),(2009),(2010),(2011),(2012)) v(year);
  1. Then cross join it with the original table to create an expanded view. Notice that each row in the earthquakes table is replicated 13 times; the resulting table has 130 rows (13 yr x 10 magnitude categories) and 130rows x 16cols = 2080 records

SELECT years.*,
       e.*
FROM earthquakes e
-- STEP 2
CROSS JOIN (
  -- STEP 1
  SELECT year
  FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
  (2007),(2008),(2009),(2010),(2011),(2012)) v(year)
       ) years;
  1. Finally, we use a CASE statement that pulls data from the correct column in the original table given the value in the year column:

SELECT 
  years.*,
  e.magnitude,
  -- STEP 3
  CASE year -- important!
         WHEN 2000 THEN year_2000
         WHEN 2001 THEN year_2001
         WHEN 2002 THEN year_2002
         WHEN 2003 THEN year_2003
         WHEN 2004 THEN year_2004
         WHEN 2005 THEN year_2005
         WHEN 2006 THEN year_2006
         WHEN 2007 THEN year_2007
         WHEN 2008 THEN year_2008
         WHEN 2009 THEN year_2009
         WHEN 2010 THEN year_2010
         WHEN 2011 THEN year_2011
         WHEN 2012 THEN year_2012
         ELSE NULL 
  END AS earthquake_cnt
FROM earthquakes e
-- STEP 2
CROSS JOIN (
    -- STEP 1
    SELECT year
    FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
           (2007),(2008),(2009),(2010),(2011),(2012)) v(year)
       ) years

And our table is ready! Below is an excerpt from the query:

Using a VIEW or by storing our new table in database, we can answer the question. above. The following query stores the new table as earthquakes_vertical:

SELECT year, magnitude, earthquake_cnt INTO earthquakes_vertical 
FROM (
	SELECT years.*,
       e.magnitude,
	   -- STEP 3
       CASE year -- important!
         WHEN 2000 THEN year_2000
         WHEN 2001 THEN year_2001
         WHEN 2002 THEN year_2002
         WHEN 2003 THEN year_2003
         WHEN 2004 THEN year_2004
         WHEN 2005 THEN year_2005
         WHEN 2006 THEN year_2006
         WHEN 2007 THEN year_2007
         WHEN 2008 THEN year_2008
         WHEN 2009 THEN year_2009
         WHEN 2010 THEN year_2010
         WHEN 2011 THEN year_2011
         WHEN 2012 THEN year_2012
         ELSE NULL 
		END AS earthquake_cnt
  FROM earthquakes e
 -- STEP 2
 CROSS JOIN (
	 -- STEP 1
       SELECT year
         FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
                      (2007),(2008),(2009),(2010),(2011),(2012)) v(year)
       ) years
) t;

It is time to answer the question:

SELECT 
	year, 
	ROUND(AVG(earthquake_cnt),2) avg_earthquake_cnt
FROM earthquakes_vertical 
WHERE magnitude <> 'No Magnitude'
GROUP BY 1
ORDER BY 1;

Output:

The same result, but in a better format!

4. AGGREAGATION vs ARITHMETIC vs PIVOTING

Let's combine what we've seen so far. The following query displays the number of players in each state, based on their year in college, in terms of FR, SO, JR, and SR, in separate columns and another column that sums the the total number of players. Results are ordered by states that have the most players to the least number of players.

SELECT 
    state, 
    COUNT(CASE WHEN year = 'FR' THEN 1 ELSE NULL END) AS fr_cnt, 
    COUNT(CASE WHEN year = 'SO' THEN 1 ELSE NULL END) AS so_cnt, 
    COUNT(CASE WHEN year = 'JR' THEN 1 ELSE NULL END) AS jr_cnt, 
    COUNT(CASE WHEN year = 'SR' THEN 1 ELSE NULL END) AS sr_cnt, 
    (COUNT(CASE WHEN year = 'FR' THEN 1 ELSE NULL END) + 
    COUNT(CASE WHEN year = 'SO' THEN 1 ELSE NULL END) + 
    COUNT(CASE WHEN year = 'JR' THEN 1 ELSE NULL END) + 
    COUNT(CASE WHEN year = 'SR' THEN 1 ELSE NULL END)) AS total_players 
FROM college_football_players 
GROUP BY 1 
ORDER BY 6 DESC;

The arithmetic shown in the last column is redundant, simply because counting all rows based on the states will suffice. Therefore, last calculation can be replaced by the following line as long as we aggregate on state:

COUNT(*) AS total_players

The query becomes:

SELECT 
    state, 
    COUNT(CASE WHEN year = 'FR' THEN 1 ELSE NULL END) AS fr_cnt, 
    COUNT(CASE WHEN year = 'SO' THEN 1 ELSE NULL END) AS so_cnt, 
    COUNT(CASE WHEN year = 'JR' THEN 1 ELSE NULL END) AS jr_cnt, 
    COUNT(CASE WHEN year = 'SR' THEN 1 ELSE NULL END) AS sr_cnt, 
    COUNT(*) AS total_players
FROM college_football_players 
GROUP BY 1 
ORDER BY total_players DESC;

5. DISTINCT Values

In real-world scenarios, you will probably need to perform several preliminary queries to be able find out the best way for answering your initial question. Looking at the unique values on each column can help identify how you might want to group or filter the data. The following query counts the number of unique months per year:

SELECT 
    year, 
    COUNT(DISTINCT month) as cnt_unique_mths 
FROM aapl_historical_stock_price 
GROUP BY 1 ORDER BY 1;

DISTINCT is only added once in the SELECT clause, i.e. there is no need to add it for each column. The results will show all of the unique values for those columns

6. CASTING: Changing Data Types

Can achieve this with two different type of syntax:

  • CAST(column_name AS integer)

  • column_name::integer

which produce the same result.

replace integer with any other data type that would make sense for that column—all values in a given column must fit with the new data types.

column_name::varchar

7. JOINs vs Subquery

-- JOIN
SELECT *
FROM playlist p1
JOIN ( SELECT DISTINCT date
          FROM playlist
          ORDER BY date DESC
          LIMIT 5
       ) p2
ON p1.date = p2.date

-- Subquery
SELECT *
FROM playlist
WHERE date IN (SELECT DISTINCT date
                 FROM playlist
                ORDER BY date DESC
                LIMIT 5)

8. COALESCE: Dealing with NULL values:

The COALESCE function is used in SQL to return the first non-null value in a list of expressions. This is particularly useful for dealing with scenarios where data may be incomplete, and you want to use a backup value when nulls are present.

Syntax:

-- replace NULLs with numeric or string data
COALESCE(column2Bupdated, value2add) -- (sales, 0) or (content, 'N/A')F

Example usage:

SELECT 
    COALESCE(column1, column2, 'default_value') AS result
FROM my_table;

In this example, result will be the value of column1 if it's not null, otherwise column2, and if both are null, it will be 'default value'. COALESCE is used especially for numerical datasets and when performing Outer JOINs to replace nulls with 0s due to unmatched rows.

Last updated