Aggregations occur only vertically. If you perform a calculation across rows, you will simply do basic arithmetic:
# AggregationSELECTSUM(sneakers_sales) AS total_sneakers_sales_2022FROM salesWHERE sales_year =2022# ArithmeticSELECT sneakers_sales + running_shoes_sales AS shoe_sales_by_monthFROM salesWHERE sales_year =2022
Note: You would not need to worry as much about the presence of NULLs with SUM as you would with COUNT, as SUM treats nulls as 0.
AVG can only be used with numerical data and ignores NULLs completely.
The difference can be easier to understood in the context of an example:
-- Query 1SELECTAVG(low) as avg_low1, SUM(low)/COUNT(*) as avg_low2 -- Counting all rowsFROM aapl_historical_stock_price;-- NOTE: avg_low1 is not equal to avg_low2-- Query 2SELECTAVG(low) as avg_low1, SUM(low)/COUNT(*) as avg_low2 -- Counting all rowsFROM aapl_historical_stock_priceWHERE low IS NOT NULL;-- NOTE: avg_low1 is equal to avg_low2-- Query 3SELECTAVG(low) as avg_low1, SUM(low)/COUNT(low) as avg_low2 -- Counting only rows in column 'low'FROM aapl_historical_stock_price-- NOTE: avg_low1 is equal to avg_low2
COUNT(*) vs COUNT(1)
You might have seen (or may see) usage of COUNT(1) but not COUNT(*)? Is there a difference between the two? NO, they’re exactly the same. (You can even use COUNT(999) and will get the same answer! Inside the COUNT() function the numbers are not treated as columns like they do with GROUP BY, HAVING, ORDER BY clauses). However, It is a better practice to use COUNT(*), as the COUNT(1) may cause confusion and it’s more commonly used
SELECTMAX(close-open) AS highest_single_day_increase FROM aapl_historical_stock_price;
2. CASE
Goes in the SELECT clause (can also be in GROUP BY)
Must include: WHEN, THEN, and END. ELSE is optional.
Multiple conditional statements may be added between WHEN and THEN using AND and OR.
Multiple WHEN statements, as well as an ELSE statement can be used to deal with any unaddressed conditions.
Create CASE statements that don't overlap.
SELECTclose,CASEWHENclose>500THEN'over 500'WHENclose>250ANDclose<=500THEN'251-500'WHENclose>100ANDclose<=250THEN'101-250'ELSE'100 or under'ENDAS close_price_groupFROM aapl_historical_stock_price
A. CASE and AGGREGATION
The following query shows the number of songs breakdown by the a group's type. If the group consists of only one singer it is considered as 'Single', if it is a band then it is 'Band', and if it is a collaborative work of various artists, then it is categorized and 'Collaboration'.
SELECTCASEWHEN group_name NOT ILIKE '%BAND%'AND group_name NOT ILIKE '%feat%'AND group_name NOT ILIKE '%with%'AND group_name NOT ILIKE '%and%'AND group_name !='Foster The People'AND group_name !='The Black Eyed Peas'AND group_name !='LMFAO'AND group_name !='OneRepublic'AND group_name !='Hot Chelle Rae'AND group_name !='Thompson Square'THEN'Single'WHEN group_name ILIKE '%BAND%'OR group_name ='Foster The People'OR group_name ='The Black Eyed Peas'OR group_name ='LMFAO'OR group_name ='OneRepublic'OR group_name ='Hot Chelle Rae'OR group_name ='Thompson Square'THEN'Band'ELSE'Collaborations'ENDAS group_type,COUNT(DISTINCT group_name) AS num_artistsFROM billboard_top_100_year_end tWHEREyear=2011GROUP BY1;
Similar task with Apple's stock prices:
SELECTCASEWHENclose>500THEN'over 500'WHENclose>250ANDclose<=500THEN'251-500'WHENclose>100ANDclose<=250THEN'101-250'ELSE'100 or under'ENDAS close_price_group, COUNT(*) AS countFROM aapl_historical_stock_price GROUP BY1;
3. PIVOTING
Pivoting is displaying a dataset horizontally that is originally in a vertical structure, or vice-versa.
3.1. VERTICAL to HORIZONTAL (Rows to Columns)
Horizontally re-orienting, i.e. pivoting, the above query :
SELECTCOUNT(CASEWHENclose>500THEN1ELSENULLEND) AS cnt_over500,COUNT(CASEWHENclose>250ANDclose<=500THEN1ELSENULLEND) AS cnt_251_500,COUNT(CASEWHENclose>100ANDclose<=250THEN1ELSENULLEND) AS cnt_101_250,COUNT(CASEWHENclose<=100THEN1ELSENULLEND) AS cnt_100_underFROM aapl_historical_stock_price;
Transitioning from horizontal to vertical orientation can be a substantially more difficult depending on the circumstances.
Another example school names. We will be displaying the number of schools with respect to their first letter in their names:
SELECTCASEWHEN school_name <'n'THEN'A-M'WHEN school_name >='n'THEN'N-Z'ELSENULLENDAS school_name_group, COUNT(*) AS school_cnt FROM usa_schools GROUP BY1ORDER BY1;
We can pivot the query as:
SELECTCOUNT(CASEWHENLEFT(school_name,1) BETWEEN'A'AND'M'THEN'A-M'ELSENULLEND) AS cnt_am, COUNT(CASEWHENLEFT(school_name,1) BETWEEN'N'AND'Z'THEN'N-Z'ELSENULLEND) AS cnt_nz FROM usa_schools;
3.2. HORIZONTAL TO VERTICAL (Columns to Rows)
Considering the structure of horizontal tables, we might not be able to answer higher level questions like what is the average sales of a certain group of products or the average number of earthquakes for a given time period. Let s put the latter case in an example. Consider we have a table called earthquakesand we would like to find the average number of earthquakes occurred on a year between 2000 and 2012:
We can answer the question as follows given the current structure of the table:
SELECTROUND(AVG(year_2000),2) AS avg_eq_cnt_2000,ROUND(AVG(year_2001),2) AS avg_eq_cnt_2001,ROUND(AVG(year_2002),2) AS avg_eq_cnt_2002,ROUND(AVG(year_2003),2) AS avg_eq_cnt_2003,ROUND(AVG(year_2004),2) AS avg_eq_cnt_2004,ROUND(AVG(year_2005),2) AS avg_eq_cnt_2005,ROUND(AVG(year_2006),2) AS avg_eq_cnt_2006,ROUND(AVG(year_2007),2) AS avg_eq_cnt_2007,ROUND(AVG(year_2008),2) AS avg_eq_cnt_2008,ROUND(AVG(year_2009),2) AS avg_eq_cnt_2009,ROUND(AVG(year_2010),2) AS avg_eq_cnt_2010,ROUND(AVG(year_2011),2) AS avg_eq_cnt_2011,ROUND(AVG(year_2012),2) AS avg_eq_cnt_2012FROM earthquakesWHERE magnitude <>'No Magnitude';
Output:
Though the output is not too bad, it is not very easy to follow either. The better way could be having the data in only 3 columns showing the year, magnitude, and number of earthquakes in separate rows. Though it takes more effort to do such conversion, the outcome is worth it. The horizontal to vertical pivoting in our case will take three steps:
First we need to create a new table that lists all of the columns from the original table, i.e. earthquakes, as rows.
Then cross join it with the original table to create an expanded view. Notice that each row in the earthquakes table is replicated 13 times; the resulting table has 130 rows (13 yr x 10 magnitude categories) and 130rows x 16cols = 2080 records
And our table is ready! Below is an excerpt from the query:
Using a VIEW or by storing our new table in database, we can answer the question. above. The following query stores the new table as earthquakes_vertical:
SELECTyear, ROUND(AVG(earthquake_cnt),2) avg_earthquake_cntFROM earthquakes_vertical WHERE magnitude <>'No Magnitude'GROUP BY1ORDER BY1;
Output:
The same result, but in a better format!
4. AGGREAGATION vs ARITHMETIC vs PIVOTING
Let's combine what we've seen so far. The following query displays the number of players in each state, based on their year in college, in terms of FR, SO, JR, and SR, in separate columns and another column that sums the the total number of players. Results are ordered by states that have the most players to the least number of players.
SELECTstate, COUNT(CASEWHENyear='FR'THEN1ELSENULLEND) AS fr_cnt, COUNT(CASEWHENyear='SO'THEN1ELSENULLEND) AS so_cnt, COUNT(CASEWHENyear='JR'THEN1ELSENULLEND) AS jr_cnt, COUNT(CASEWHENyear='SR'THEN1ELSENULLEND) AS sr_cnt, (COUNT(CASEWHENyear='FR'THEN1ELSENULLEND) +COUNT(CASEWHENyear='SO'THEN1ELSENULLEND) +COUNT(CASEWHENyear='JR'THEN1ELSENULLEND) +COUNT(CASEWHENyear='SR'THEN1ELSENULLEND)) AS total_players FROM college_football_players GROUP BY1ORDER BY6DESC;
The arithmetic shown in the last column is redundant, simply because counting all rows based on the states will suffice. Therefore, last calculation can be replaced by the following line as long as we aggregate on state:
COUNT(*) AS total_players
The query becomes:
SELECTstate, COUNT(CASEWHENyear='FR'THEN1ELSENULLEND) AS fr_cnt, COUNT(CASEWHENyear='SO'THEN1ELSENULLEND) AS so_cnt, COUNT(CASEWHENyear='JR'THEN1ELSENULLEND) AS jr_cnt, COUNT(CASEWHENyear='SR'THEN1ELSENULLEND) AS sr_cnt, COUNT(*) AS total_playersFROM college_football_players GROUP BY1ORDER BY total_players DESC;
5. DISTINCT Values
In real-world scenarios, you will probably need to perform several preliminary queries to be able find out the best way for answering your initial question. Looking at the unique values on each column can help identify how you might want to group or filter the data. The following query counts the number of unique months per year:
SELECTyear, COUNT(DISTINCTmonth) as cnt_unique_mths FROM aapl_historical_stock_price GROUP BY1ORDER BY1;
DISTINCT is only added once in theSELECTclause, i.e. there is no need to add it for each column. The results will show all of the unique values for those columns
6. CASTING: Changing Data Types
Can achieve this with two different type of syntax:
CAST(column_name AS integer)
column_name::integer
which produce the same result.
replace integer with any other data type that would make sense for that column—all values in a given column must fit with the new data types.
The COALESCE function is used in SQL to return the first non-null value in a list of expressions. This is particularly useful for dealing with scenarios where data may be incomplete, and you want to use a backup value when nulls are present.
Syntax:
-- replace NULLs with numeric or string dataCOALESCE(column2Bupdated, value2add) -- (sales, 0) or (content, 'N/A')F
Example usage:
SELECTCOALESCE(column1, column2, 'default_value') AS resultFROM my_table;
In this example, result will be the value of column1 if it's not null, otherwise column2, and if both are null, it will be 'default value'. COALESCE is used especially for numerical datasets and when performing Outer JOINs to replace nulls with 0s due to unmatched rows.