SQL Tips

1. AGGREGATION

Aggregations occur only vertically. If you perform a calculation across rows, you will simply do basic arithmetic:

# Aggregation
SELECT 
    SUM(sneakers_sales) AS total_sneakers_sales_2022
FROM sales
WHERE sales_year = 2022 


# Arithmetic
SELECT 
    sneakers_sales + running_shoes_sales AS shoe_sales_by_month
FROM sales
WHERE sales_year = 2022 

Note: You would not need to worry as much about the presence of NULLs with SUM as you would with COUNT, as SUM treats nulls as 0.

AVG can only be used with numerical data and ignores NULLs completely.

The difference can be easier to understood in the context of an example:

-- Query 1
SELECT 
    AVG(low) as avg_low1, 
    SUM(low)/COUNT(*) as avg_low2 -- Counting all rows
FROM aapl_historical_stock_price;
-- NOTE: avg_low1 is not equal to avg_low2

-- Query 2
SELECT 
    AVG(low) as avg_low1, 
    SUM(low)/COUNT(*) as avg_low2 -- Counting all rows
FROM aapl_historical_stock_price
WHERE low IS NOT NULL;
-- NOTE: avg_low1 is equal to avg_low2

-- Query 3
SELECT 
    AVG(low) as avg_low1, 
    SUM(low)/COUNT(low) as avg_low2 -- Counting only rows in column 'low'
FROM aapl_historical_stock_price
-- NOTE: avg_low1 is equal to avg_low2

COUNT(*) vs COUNT(1)

You might have seen (or may see) usage of COUNT(1) but not COUNT(*)? Is there a difference between the two? NO, they’re exactly the same. (You can even use COUNT(999) and will get the same answer! Inside the COUNT() function the numbers are not treated as columns like they do with GROUP BY, HAVING, ORDER BY clauses). However, It is a better practice to use COUNT(*), as the COUNT(1) may cause confusion and it’s more commonly used

2. CASE

  1. Goes in the SELECT clause (can also be in GROUP BY)

  2. Must include: WHEN, THEN, and END. ELSE is optional.

  3. Multiple conditional statements may be added between WHEN and THEN using AND and OR.

  4. Multiple WHEN statements, as well as an ELSE statement can be used to deal with any unaddressed conditions.

Create CASE statements that don't overlap.

A. CASE and AGGREGATION

The following query shows the number of songs breakdown by the a group's type. If the group consists of only one singer it is considered as 'Single', if it is a band then it is 'Band', and if it is a collaborative work of various artists, then it is categorized and 'Collaboration'.

Similar task with Apple's stock prices:

3. PIVOTING

Pivoting is displaying a dataset horizontally that is originally in a vertical structure, or vice-versa.

3.1. VERTICAL to HORIZONTAL (Rows to Columns)

Horizontally re-orienting, i.e. pivoting, the above query :

Transitioning from horizontal to vertical orientation can be a substantially more difficult depending on the circumstances.

Another example school names. We will be displaying the number of schools with respect to their first letter in their names:

We can pivot the query as:

3.2. HORIZONTAL TO VERTICAL (Columns to Rows)

Considering the structure of horizontal tables, we might not be able to answer higher level questions like what is the average sales of a certain group of products or the average number of earthquakes for a given time period. Let s put the latter case in an example. Consider we have a table called earthquakes and we would like to find the average number of earthquakes occurred on a year between 2000 and 2012:

Earthquakes Table

We can answer the question as follows given the current structure of the table:

Output:

Though the output is not too bad, it is not very easy to follow either. The better way could be having the data in only 3 columns showing the year, magnitude, and number of earthquakes in separate rows. Though it takes more effort to do such conversion, the outcome is worth it. The horizontal to vertical pivoting in our case will take three steps:

  1. First we need to create a new table that lists all of the columns from the original table, i.e. earthquakes, as rows.

  1. Then cross join it with the original table to create an expanded view. Notice that each row in the earthquakes table is replicated 13 times; the resulting table has 130 rows (13 yr x 10 magnitude categories) and 130rows x 16cols = 2080 records

  1. Finally, we use a CASE statement that pulls data from the correct column in the original table given the value in the year column:

And our table is ready! Below is an excerpt from the query:

Using a VIEW or by storing our new table in database, we can answer the question. above. The following query stores the new table as earthquakes_vertical:

It is time to answer the question:

Output:

The same result, but in a better format!

4. AGGREAGATION vs ARITHMETIC vs PIVOTING

Let's combine what we've seen so far. The following query displays the number of players in each state, based on their year in college, in terms of FR, SO, JR, and SR, in separate columns and another column that sums the the total number of players. Results are ordered by states that have the most players to the least number of players.

The arithmetic shown in the last column is redundant, simply because counting all rows based on the states will suffice. Therefore, last calculation can be replaced by the following line as long as we aggregate on state:

COUNT(*) AS total_players

The query becomes:

5. DISTINCT Values

In real-world scenarios, you will probably need to perform several preliminary queries to be able find out the best way for answering your initial question. Looking at the unique values on each column can help identify how you might want to group or filter the data. The following query counts the number of unique months per year:

DISTINCT is only added once in the SELECT clause, i.e. there is no need to add it for each column. The results will show all of the unique values for those columns

6. CASTING: Changing Data Types

Can achieve this with two different type of syntax:

  • CAST(column_name AS integer)

  • column_name::integer

which produce the same result.

replace integer with any other data type that would make sense for that column—all values in a given column must fit with the new data types.

column_name::varchar

7. JOINs vs Subquery

8. COALESCE: Dealing with NULL values:

The COALESCE function is used in SQL to return the first non-null value in a list of expressions. This is particularly useful for dealing with scenarios where data may be incomplete, and you want to use a backup value when nulls are present.

Syntax:

Example usage:

In this example, result will be the value of column1 if it's not null, otherwise column2, and if both are null, it will be 'default value'. COALESCE is used especially for numerical datasets and when performing Outer JOINs to replace nulls with 0s due to unmatched rows.

Last updated

Was this helpful?