SQL Flashcards

(176 cards)

1
Q

SELECT

A

(SQL DML Clause) Selects data from specified columns.

SELECT
*
FROM customers

The above query will select all columns from the customer table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SELECT DISTINCT

A

(SQL DML Clause) Removes duplicates & ensures that each value only appears once in the results.

SELECT DISTINCT Country
FROM customers

Select each distinct country from the customers table. If ‘Germany’ or another country shows up more than once, it will only be selected once.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

TOP

A

(SQL DML Clause) Used to limit your data. Restrict the number of rows.

SELECT TOP 3
FROM customers

Above query will select the top 3 customers (i.e., it will select customers with the customer id 1, 2, and 3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

FROM

A

(SQL DML Clause) Specifies where the data comes from (i.e., which table)

SELECT
*
FROM Customers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

WHERE

A

(SQL DML Clause) Filters data based on a condition.

SELECT *
FROM customers
WHERE country =’Mexico’;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

GROUP BY

A

(SQL DML Clause) Splits the result set into groups of values. Commonly used with aggregate functions.

The GROUP BY clause is used to arrange data into groups with the help of aggregate functions.

The way that it works is, if a particular column has the same values in different rows then it will amalgamate these rows into a group

Syntax:
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
ORDER BY column_name(s)

SELECT
COUNT(customerID),
Country
FROM Customers
GROUP BY Country

Output:
3 Argentina
2 Austria
2 Belgium

Group by vs Order by:
ORDER BY example:
SELECT
first_name, last_name
FROM actor
ORDER BY last_name, first_name;

Output
Christian Akroyd
Debbie Akroyd
Cuba Allen
Kim Allen
Meryl Allen
Jim Astaire

GROUP BY example:
SELECT count(film_actor.actor_id) AS num_of_films_appeared_in,
actor.actor_id,
actor.first_name,
actor.last_name
FROM actor
INNER JOIN film_actor
ON actor.actor_id = film_actor.actor_id
GROUP BY film_actor.actor_id;

Output:
2 1 Penelope Guiness
25 2 Nick Wahlberg
22 3 Ed Chase

GROUP BY & ORDER BY together:
SELECT count(film_actor.actor_id) AS num_off_films_appeared_in,
actor.actor_id,
actor.first_name,
actor.last_name
FROM actor INNER JOIN film_actor
ON actor.actor_rid = film_actor.actor_id
GROUP BY film.actor.actor_id
ORDER BY num_of_films_appeared_in DESC;

Output:
42 107 Gina Degeneres
41 102 Walter Torn
40 198 Mary Keitel
39 181 Matthew Carrey

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Aggregate Functions

A

(SQL DML Clause) Aggregate functions include MIN(), MAX(), COUNT(), SUM(), AVG()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HAVING

A

(SQL DML Clause) Used to filter aggregated data. *Can only use the HAVING function with the GROUP BY clause.

Syntax:
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s)

SELECT
COUNT(CustomerID),
Country
FROM Customers
GROUP BY Country
HAVING COUNT(Customer_ID) > 5

Output:
9 Brazil
11 France
11 Germany

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ORDER BY

A

(SQL DML Clause) Sorts data. Use either ASC (lowest to highest value) or DESC (highest to lowest value). If you don’t specify ASC or DESC, ASC will be used; ASC is the default.

SELECT
*
FROM Products
ORDER BY Price

Output:
product_id product_name supplier_id category_id unit price
33 Geitost 15 4 500g 2.5

Group by vs Order by
ORDER BY example:
SELECT
first_name, last_name
FROM actor
ORDER BY last_name, first_name;

Output
Christian Akroyd
Debbie Akroyd
Cuba Allen
Kim Allen
Meryl Allen
Jim Astaire

GROUP BY example:
SELECT count(film_actor.actor_id) AS num_of_films_appeared_in,
actor.actor_id,
actor.first_name,
actor.last_name
FROM actor
INNER JOIN film_actor
ON actor.actor_id = film_actor.actor_id
GROUP BY film_actor.actor_id;

Output:
2 1 Penelope Guiness
25 2 Nick Wahlberg
22 3 Ed Chase

GROUP BY & ORDER BY together:
SELECT count(film_actor.actor_id) AS num_off_films_appeared_in,
actor.actor_id,
actor.first_name,
actor.last_name
FROM actor INNER JOIN film_actor
ON actor.actor_rid = film_actor.actor_id
GROUP BY film.actor.actor_id
ORDER BY num_of_films_appeared_in DESC;

Output:
42 107 Gina Degeneres
41 102 Walter Torn
40 198 Mary Keitel
39 181 Matthew Carrey

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

COUNT

A

(SQL DML Clause) Count the total number of something.

SELECT
COUNT(*)
FROM gold.dim_customers

Output:
18484

SELECT
*
FROM gold.dim_customers

Output:
All info from table (18484 rows, so 18484 customers IF each row represents one customer)

SELECT COUNT(DISTINCT (customerID) AS unique_customers
FROM gold.dim_customers

Output
Unique Customers
18484

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SQL DML Clauses in Execute Order

A

1) FROM – Identify source tables and joins
2) WHERE – Filter individual rows (raw data)
3) GROUP BY – Put the rows in buckets
4) Aggregate functions – Calculate aggregations (SUM, COUNT, AVG, MIN, & MAX)
5) HAVING – Filter groups based on aggregated results
6) SELECT – Return chosen columns or expressions (can now include aggregates)
7) DISTINCT – Remove duplicate rows from the final result set.
8) ORDER BY – Sort the results
9) TOP/LIMIT – Return only a subset of rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

SQL DML Commands

A

INSERT
UPDATE
DELETE
TRUNCATE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

INSERT

A

(SQL DML Command)
Manipulate data (i.e., insert new rows or update rows).
2 Methods:
1) Manually
2) Use another table

Syntax:
1) Manually:
INSERT INTO table_name (col1, col2, col3, etc.)
VALUES (value1, value2, value3, etc.)

Using another table:
INSERT INTO destination_table (col1, col2, col3, etc.)
SELECT col1, col2, etc.
FROM source_table

Examples:

Manually:
INSERT INTO Customers (customer_name, contact_name, address, city, postal_code, country)
VALUES (‘cardinal’, ‘Tom B. Erichsen’, ‘871 Stanley St.’, ‘Frederick’, ‘20987’, ‘USA’)

Using another table:
INSERT INTO High_Value_Orders (order_id, customer_id, order_total)
SELECT order_id, customer_id, total_amount
FROM All_Orders

INSERT INTO New_Customers
SELECT
*
FROM Old_Customers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

UPDATE

A

(SQL DML Command)
Change the content of already existing rows

Syntax:
UPDATE table_name
SET col1 = value1, col2 = value2, etc.
WHERE condition;

Example:

UPDATE Customers
SET contact_name = ‘Alfred Schmidt’, City = ‘Frankfurt’
WHERE customer_id = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DELETE

A

(SQL DML Command)
Remove already existing row(s) inside your table.

Syntax:
DELETE
FROM table_name
WHERE condition

Example:
DELETE
FROM Customers
WHERE customer_name = ‘Alfred Futterkiste’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

TRUNCATE

A

(SQL DML Command)
Use if you want to delete EVERYTHING from a table.
Clears all records from a table without removing its structure.

TRUNCATE is faster than DELETE, so use TRUNCATE when you’re working with large tables.

Example:
TRUNCATE TABLE Categories
This will delete the table categories

TRUNCATE TABLE users
Empties the users table while keeping its schema intact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

AND

A

(SQL Logical Operator)
All conditions must be true

Syntax:
SELECT col1, col2, col3, etc.
FROM table_name
WHERE condition1 AND condition 2 AND condition3, etc.

Example:
SELECT
*
FROM Customers
WHERE country = ‘Spain’ AND customer_name LIKE ‘G%’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

OR

A

(SQL Logical Operator)
At least one condition must be true

Syntax:
SELECT col1, col2, etc.
FROM table_name
WHERE condition1 OR condition2 OR condition3

Example:
SELECT
*
FROM Customers
WHERE country = ‘Germany’ OR country = ‘Spain’;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

NOT

A

(SQL Logical Operator)
Excludes matching values

Syntax:
SELECT col1, col2, etc.
FROM table_name
WHERE NOT condition;

Example:
SELECT
*
FROM Customers
WHERE NOT country = ‘Spain’;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

LIKE

A

(SQL Logical Operator)

Syntax:
SELECT col1, col2
FROM table_name
WHERE col LIKE pattern;

Example 1: Pull all records from Customer table where the customer’s name starts with ‘a’.
SELECT
*
FROM Customers
WHERE customer_name LIKE ‘a%’;

Example 2: Pull all records from the Customer table where where the city contains the letter ‘n’.
SELECT
*
FROM Customers
WHERE city LIKE ‘%L%’;

Example 3: Pull all records from the Customer table where the customer’s name ends with an ‘a’
SELECT
*
FROM Customers
WHERE CustomerName LIKE ‘%a’

Example 4: Return all customers that start with ‘La’
SELECT
*
FROM Customers
WHERE CustomerName LIKE ‘La%’;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

BETWEEN

A

(SQL Logical Operator)
Checks if a value is within a range.

Syntax:
SELECT column_name
FROM table_name
WHERE column_name BETWEEN value1 AND value2

Example:
SELECT
*
FROM Products
WHERE Price BETWEEN 10 AND 20

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

IN

A

(SQL Logical Operator)

Syntax:
SELECT column_name
FROM table_name
WHERE column_name IN (value1, value2, etc.);

Example:
SELECT
*
FROM Customers
WHERE country IN (‘Germany’, ‘France’, ‘UK’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

CREATE

A

(SQL DDL Command)
Defines new database objects like tables, indexes, or entire databases.

-CREATE DATABASE my_database: Creates a new database
-CREATE TABLE users (id INT PRIMARY KEY, name VARCHAR (50), email VARCHAR (100): Defines a table with columns and data types.
-CREATE INDEX idx_email ON users (email): Generates an index for faster lookups.

Note: When making a new object, it will be empty (e.g., if you make a new table, it will be an empty table.)

Syntax:
CREATE TABLE table_name (
col1 datatype,
col2 datatype,
col3 datatype,
etc.
);

Example:
CREATE TABLE Persons (
PersonID int,
LastName varchar(255),
FirstName varchar(255),
Address varchar(255),
City varchar(255)
);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

ALTER

A

SQL DDL Command. Modifies existing database structures without affecting stored data.

So, it can edit the table (i.e., add a column, change the data type, etc.)

Syntax:
ALTER TABLE table_name
ADD column_name datatype;

Example:
ALTER TABLE Customers
ADD Email varchar(255);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
DROP
Deletes database objects permanently. It can remove information from a table. -DROP TABLE users: Removes the users table and its data. -DROP INDEX idx_email ON users: Deletes an index. NOTE: If you delete a column, you will remove all of the data inside that column as well. Syntax: DROP TABLE table_name; Example: DROP TABLE Shippers;
26
SQL Logical Operators
AND BETWEEN IN LIKE NOT OR
27
SQL DDL Commands
CREATE ALTER DROP
28
SQL JOINs
No Join Inner Join Left Join Right Join Full Join
29
SMART Goals
Specific, Measurable, Actionable, Relevant, Time-Bound Specific: Do you currently use data to drive decisions in your business? If so, what kind(s) of data do you collect, and how do you use it? Measurable: Do you know what percentage of sales is from your top-selling products? Actionable: Are there business decisions or changes that you would make if you had the right information? For example, if you had information about how umbrella sales change with the weather, how would you use it? Relevant: How often do you review data from your business? Time-bound: Can you describe how data helped you make good decisions for your store(s) this past year?
30
Database Management System (DBMS)
Software that manages all of the requests that are put through to the database. (Requests may include people using SQL to request info from the database, applications using SQL to request info from the database, SQL requests to create data visuals from the database, etc.)
31
Server
The hardware. Like a powerful PC, lives 24/7. (physical machine where database(s) live) can host multiple databases (i.e., containers of your data)
32
Server, DBMS, SQL, Database
Server>DBMS>SQL>Database Server: Physical machine where the database lives DBMS: Software that manages the database SQL: Language used to speak to the database SQL: Language used to speak to the database
33
5 Types of Databases
* Relational Database: spreadsheets (i.e., columns and rows) w/ relationships linking them to each other. * Key-Value Database: Pairs of keys and value. (Think of it like a dictionary where you have a word (key) and definition (value)). * Column-Based Database: Group data by columns (as opposed to by row). This is an advanced database that handles large quantities of data. Main purpose is to search for data. * Graph Database: The relationship between objects. Main idea: how to connect my data points? Document Database: Data stored as entire documents. Structure of data is less important.
34
*Relational Database
spreadsheets (i.e., columns and rows) w/ relationships linking them to each other. NOTE: The relational database is called a SQL database (e.g., Microsoft SQL Server, MySQL, PostgreSQL)
35
Key-Value Database
Pairs of keys and value. (Think of it like a dictionary where you have a word (key) and definition (value)). A key-value database is called a No SQL database. Key-value databases include Redis, Amazon Dynamo DB)
36
Column-Based Database
Group data by columns (as opposed to by row). This is an advanced database that handles large quantities of data. Main purpose is to search for data. A column-based database is called a No SQL database. Column-based databases include Apache Cassandra, Amazon Redshift
37
Graph Database
The relationship between objects. Main idea: how to connect my data points? A graph database is called a No SQL database. Graph databases include Neo4j
38
Document Database
Data stored as entire documents. Structure of data is less important. A document database is called a No SQL database. Document databases include MangoDB
39
Server>Database>Schema
* Servers (physical machine where database(s) live) can host multiple databases (i.e., containers of your data) > * Each database can have multiple schemas (i.e., a logical container in which you can group up related objects)> ○ Example: You have hundreds of tables, put all tables that have to do with order in one schema & all orders that have to do with customers in another schema). * Each schema can have multiple objects (e.g., tables). ○ A table organizes data into columns (e.g.., ID, name, score, birthdate, etc.). (Each column may be called a field.) ○ The data is stored in each row. (A row may be called a record.) ○ Primary Key: every table has this. Necessary to have one unique identifier (e.g., each customer has a unique ID like 1 or 2 or 3). ○ Cell: singular value § Data Types: □ Numeric: Integer (1, 2, 30), Decimal (3.14, 100.50) □ String/Text: Char or Varchar ® Char: fixed (e.g., if you define it as 5 characters, it will go and reserve 5 characters from the space) ® Varchar: use when you want things to be more dynamic ® Date & Time: ◊ Date: '2025-10-30' Time: '09:30:00'
40
Data Definition Language (DDL)
used to define the way data is structured in a database. DDL consists of SQL commands that define and modify database structures. It sets up structures like tables and schemas, modifies their design, and removes them when they're no longer needed. When setting up a database, DDL commands define tables, indexes, and relationships. These commands modify structures and remove database objects. ○ Create: Defines new database objects like tables, indexes, or entire databases. § CREATE DATABASE my_database: Creates a new database § CREATE TABLE users (id INT PRIMARY KEY, name VARCHAR (50), email VARCHAR (100): Defines a table with columns and data types. § CREATE INDEX idx_email ON users (email): Generates an index for faster lookups. ○ Alter: Modifies existing database structures without affecting stored data. § ALTER TABLE users ADD COLUMN phone VARCHAR (20): Adds a new column to the users table. § ALTER TABLE users MODIFY COLUMN name VARCHAR (100): Changes a column's data type. ○ Drop: Deletes database objects permanently. § DROP TABLE users: Removes the users table and its data. § DROP INDEX idx_email ON users: Deletes an index. ○ Truncate: Clears all records from a table without removing its structure. § TRUNCATE TABLE users: Empties the users table while keeping its schema intact.
41
Data Manipulation Language (DML)
used to manage data within a database. It's a component of SQL used to insert, retrieve, modify, and delete records within a database structure. a. INSERT data into a database/adds new rows to a table (e.g., you have an app that is generating a lot of data, you would use SQL to "insert" that data into the database). b. UPDATE already existing data in table/modifies existing data within a table c. DELETE already existing data from table/removes specific rows from a table. d. SELECT: Retrieves data from one or more tables (e.g., selects records from the customers table)
42
Data Query Language (DQL)
Responsible for reading/querying data from the database. In SQL, this corresponds to the SELECT
43
Data Control Language (DCL)
Responsible for the administrative tasks of controlling the database itself (e.g., granting and revoking database permissions for users).
44
DDL, DML, DQL, DCL
DDL: Data Definition Language. Defines database objects like tables and indexes DML: Data Manipulation Language. Manages data within those structures DQL: Data Query Language. Queries data from the database. DCL: Data Control Language. Responsible for admin tasks related to the database (i.e., granting and revoking database permissions for users)
45
GROUP BY vs. ORDER BY (examples)
ORDER BY example: SELECT first_name, last_name FROM actor ORDER BY last_name, first_name; Output Christian Akroyd Debbie Akroyd Cuba Allen Kim Allen Meryl Allen Jim Astaire GROUP BY example: SELECT count(film_actor.actor_id) AS num_of_films_appeared_in, actor.actor_id, actor.first_name, actor.last_name FROM actor INNER JOIN film_actor ON actor.actor_id = film_actor.actor_id GROUP BY film_actor.actor_id; Output: 2 1 Penelope Guiness 25 2 Nick Wahlberg 22 3 Ed Chase GROUP BY & ORDER BY together: SELECT count(film_actor.actor_id) AS num_off_films_appeared_in, actor.actor_id, actor.first_name, actor.last_name FROM actor INNER JOIN film_actor ON actor.actor_rid = film_actor.actor_id GROUP BY film.actor.actor_id ORDER BY num_of_films_appeared_in DESC; Output: # films actor_id first_name last_name 42 107 Gina Degeneres 41 102 Walter Torn 40 198 Mary Keitel 39 181 Matthew Carrey
46
47
SQL Set Operators
UNION: Merges all unique rows from two or more SELECT statements, eliminating duplicates. UNION ALL: Merges all rows from two or more SELECT statements, keeping duplicates. INTERSECT: Returns only the rows that appear in both SELECT statements. EXCEPT: Returns rows from the first SELECT statement that don't appear in the second.
48
UNION
A SQL Set Operator Merges all unique rows from two or more SELECT statements, eliminating duplicates. For example, suppose we have two tables, employees and contractors, each with similar columns such as contractors, department, and salary. For learning purposes, let’s consider these two dummy tables: name department salary Alice Marketing 65000 Bob Sales 70000 Carol Engineering 80000 John HR 55000 Employees name department salary David Marketing 60000 Eva Sales 68000 Carol Engineering 75000 We can combine the results from both tables using the following command: -- Using INTERSECT to find common employees SELECT name, department FROM employees INTERSECT SELECT name, department FROM contractors; This query selects the name, department, and salary columns from both the employees and contractors tables and combines them into a single result set. The UNION operator automatically removes duplicate rows from the final result set. name department salary Alice Marketing 65000 Bob Sales 70000 Carol Engineering 80000 John HR 55000 David Marketing 60000 Eva Sales 68000 Notice that Carol, who appears in both tables, is only listed once in the result. If we wanted to keep both instances of Carol (with her different salaries), we would use UNION ALL. NOTE: the UNION operator does not remove NULL values. If a column contains NULL values in one result set and non-NULL values in the corresponding column of another result set, the NULL values will be retained in the final result set produced by the UNION operator. If we want to include NULL values in the result set and prevent their removal by the UNION operator, we can use the UNION ALL operator instead. This operator combines the results of multiple SELECT queries, including all rows from each result set, regardless of whether they are duplicates or contain NULL values.
49
50
INTERSECT
The INTERSECT operator returns only the rows that appear in both result sets. Think of it as finding the people who belong to both groups. -- Using INTERSECT to find common employees SELECT name, department FROM employees INTERSECT SELECT name, department FROM contractors; Output: name department Carol Engineering Note: The INTERSECT operator handles NULL values based on standard comparison rules, considering NULL values equal when comparing corresponding columns. It also results in an empty set when dealing with empty result sets. In other words, if a NULL value is present in one result set and the corresponding column in the other result set contains a non-NULL value, the rows are not considered equal – they will not be included in the intersection result. Additionally, If one of the result sets provided to the INTERSECT operator is empty (i.e., it contains no rows), the overall result of the INTERSECT operation will also be empty since there are no common rows between an empty set and any other set.
51
EXCEPT
SQL Set Operator The EXCEPT operator retrieves the rows present in the first result set but not in the second. For example, let’s say we execute the following query: -- Using EXCEPT to find employees who are not contractors SELECT name, department, salary FROM employees EXCEPT SELECT name, department, salary FROM contractors; The name, department, and salary columns are selected from the employees table and return only the rows that do not exist in the contractors table. name department salary Alice Marketing 65000 Bob Sales 700 John HR 55000 Note: The EXCEPT operator also follows standard comparison rules for handling NULL values. Its behavior with empty result sets results in an empty set if the first result set is empty or includes all rows from the first result set if the second result set is empty.
52
Rules of Set Operators
Set Operators: Union, Intersect, Except Rule #1: ORDER BY can only be used once Rule #2: Same number of columns Rule #3: Matching data types Rule #4: Same order of columns Rule #5: First query controls aliases Rule #6: Map correct columns
53
Find and return records with even values. Find and return records with odd values.
To find and return values with even or odd values: check the remainder when you divide by 2. -If the remainder is 0, it's an even number. -If not, it's an odd number. --Find rows where a specified column has even values (MS SQL Server): SELECT * FROM table_name WHERE column_name % 2 = 0; Find rows where a specified column has odd values (MS SQL Server): SELECT * FROM table_name WHERE column_name % 2 <> 0;
54
PRACTICE: Query the two cities in STATION with the shortest and longest CITY names, as well as their respective lengths (i.e.: number of characters in the name). If there is more than one smallest or largest city, choose the one that comes first when ordered alphabetically. The STATION table is described as follows: Field Type ID Number CITY
( SELECT CITY, LENGTH(CITY) AS NAME_LENGTH FROM STATION ORDER BY LENGTH(CITY) ASC, CITY ASC LIMIT 1 ) UNION ( SELECT CITY, LENGTH(CITY) AS NAME_LENGTH FROM STATION ORDER BY LENGTH(CITY) DESC, CITY ASC LIMIT 1 ); LENGTH(CITY): gets the number of characters in the city name. The first subquery: -Orders by LENGTH(CITY) ascending (shortest first), then by CITY alphabetically. -LIMIT 1 gets the shortest city name. The second subquery: -Orders by LENGTH(CITY) descending (longest first), then by CITY alphabetically. -LIMIT 1 gets the longest city name. -UNION ensures both rows appear in the result (duplicates are not an issue here since lengths differ).
55
PRACTICE: Query the list of CITY names from STATION which have vowels (i.e., a, e, i, o, and u) as both their first and last characters. Your result cannot contain duplicates.
SELECT DISTINCT CITY FROM STATION WHERE LOWER(LEFT(CITY, 1)) IN ('a', 'e', 'i', 'o', 'u') AND LOWER(RIGHT(CITY, 1)) IN ('a', 'e', 'i', 'o', 'u'); Explanation: DISTINCT: removes duplicates. LOWER(...): ensures the comparison is case-insensitive. LEFT(CITY, 1): gets the first character. RIGHT(CITY, 1): gets the last character. The IN clause checks whether the first and last characters are vowels.
56
LEFT()
LEFT(column_name, character_number) Example: LEFT(CITY, 1): Gets the first character of the City name Example 2: LEFT(CITY, 1) IN ('a', 'e', 'i', 'o', 'u') Output: Pulls any names starting with a vowel. Example 3: LOWER(LEFT(CITY, 1)) IN ('a', 'e', 'i', 'o', 'u') Output: Pulls any names starting with a vowel after ensuring that the comparison is case sensitive (i.e., makes all characters lowercase so as to not inadvertently exclude pertinent values).
56
RIGHT()
RIGHT(column_name, character_number) Example 1: RIGHT(CITY, 1) Output: Gets the last character of the city name Example 2: RIGHT(CITY, 1) IN ('a', 'e', 'i', 'o', 'u') Output: Pulls any names ending in vowel. Example 3: LOWER(RIGHT(CITY, 1) IN ('a', 'e', 'i', 'o', 'u') ) Output: Pulls any names starting with a vowel after ensuring that the comparison is case sensitive (i.e., makes all characters lowercase so as to not inadvertently exclude pertinent values).
57
LOWER()
lowercases values Example: LOWER(RIGHT(CITY, 1) IN ('a', 'e', 'i', 'o', 'u') ) Output: Pulls any names starting with a vowel after ensuring that the comparison is case sensitive (i.e., makes all characters lowercase so as to not inadvertently exclude pertinent values).
58
SQL Logic Checklist: AND vs OR
AND can be too strict. In the below example query, using AND (i.e., AND NOT LOWER) only produces results in which the city name does not start with a vowel or end with a vowel. Whereas, using OR (i.e., OR NOT LOWER) gives an output of city names that do not start with a vowel AND city names that do not end with a vowel. Example: Query the list of CITY names from STATION that either do not start with vowels or do not end with vowels. Your result cannot contain duplicates. Solution: SELECT DISTINCT CITY FROM STATION WHERE NOT LOWER(LEFT(CITY, 1)) IN ('a', 'e', 'i', 'o', 'u') OR NOT LOWER(RIGHT(CITY, 1)) IN ('a', 'e', 'i', 'o', 'u') ORDER BY CITY; SQL Logic Checklist: AND vs OR 1. Rephrase the condition in plain English. "I want cities that do not start with a vowel OR do not end with a vowel." → Clearly calls for OR, not AND. 2. Use a simple Venn diagram in your mind. AND = overlap (both conditions must be true). ✅ Great for narrowing results OR = union (either condition can be true). ✅ Great for widening results
59
SQL Practice: Query the Name of any student in STUDENTS who scored higher than Marks. Order your output by the last three characters of each name. If two or more students both have names ending in the same last three characters (i.e.: Bobby, Robby, etc.), secondary sort them by ascending ID.
SELECT Name FROM STUDENTS WHERE Marks > 75 ORDER BY LOWER(RIGHT(Name, 3)), ID;
60
SQL Practice: Query the following two values from the STATION table: The sum of all values in LAT_N rounded to a scale of decimal places. The sum of all values in LONG_W rounded to a scale of decimal places.
SELECT ROUND(SUM(LAT_N), 2) AS sum_lat_n, ROUND(SUM(LONG_W), 2) AS sum_long_w FROM STATION
61
ROUND()
The ROUND() function rounds a number to a specified number of decimal places. Syntax: ROUND(number, decimals, operation/function) Parameter Description -number Required. The number to be rounded. -decimals Required. The number of decimal places to round number to. -operation/function Optional. It decides the rounding direction for values exactly in the middle of two numbers. By default, its value is 0. Any other value causes the function to truncate or cut off the input without rounding. Example 1: Round the number to 2 decimal places, and also use the operation parameter: SELECT ROUND(235.415, 2, 1) AS RoundValue; Output: 235.410 Example 2: Round the number to -1 decimal place: SELECT ROUND(235.415, -1) AS RoundValue; Output: 240.000 Example 3: SELECT ROUND(SUM(LAT_N), 2) AS sum_lat_n, ROUND(SUM(LONG_W), 2) AS sum_long_w FROM STATION Output: 42850.04 47381.48
62
TRUNCATE() function
The TRUNCATE() function truncates a number to the specified number of decimal places. Syntax: TRUNCATE(number, decimals) Example: SELECT TRUNCATE(345.156, 0); Output: 345
63
SQL Practice: Query the greatest value of the Northern Latitudes (LAT_N) from STATION that is less than . Truncate your answer to decimal places.
Option 1 (using aggregate; more concise/cleaner): SELECT TRUNCATE(MAX(LAT_N), 4) AS rounded_lat_n FROM STATION WHERE LAT_N < 137.2345; Option 2: SELECT TRUNCATE(LAT_N, 4) AS rounded_lat_n FROM STATION WHERE LAT_N < 137.2345 ORDER BY LAT_N DESC LIMIT 1;
64
SQL Practice: Query the smallest Northern Latitude (LAT_N) from STATION that is greater than . Round your answer to decimal places.
SELECT ROUND(LAT_N, 4) FROM STATION WHERE LAT_N > 38.7780 ORDER BY LAT_N ASC LIMIT 1;
65
ABS()
The ABS() function returns the absolute (positive) value of a number. Why is this needed? Any distance metric (e.g., the Manhattan Distance), must always be non‑negative. Why can't a distance metric be negative? -Distance measures magnitude only (i.e., how much), not direction (i.e., which way). -A negative number would imply "distance traveled backward," which has no meaning in this context. -Without ABS, you’re just summing signed differences, which can cancel out and give zero or even negative "distance," which is mathematically wrong. Syntax: ABS(number) Parameter Description number Required. A numeric value Example: SELECT ABS(-243.5); Output: 243.5
66
Define Manhattan Distance & Solve the following prompt: Consider P1(a, b) and P2 (c,d) to be two points on a 2D plane. -a happens to equal the minimum value in Northern Latitude (LAT_N in STATION). -b happens to equal the minimum value in Western Longitude (LONG_W in STATION). -c happens to equal the maximum value in Northern Latitude (LAT_N in STATION). -d happens to equal the maximum value in Western Longitude (LONG_W in STATION). Query the Manhattan Distance between points P1 and P2 and round it to a scale of 4 decimal places. Manhattan Distance Definition: The distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 - x2| + |y1 - y2|.
Shortest distance between 2 points. (Based on the grid-like system of Manhattan.) d = (b1- c1) + (b2 - c2) SQL Query to Solve the Provided Prompt: /* x1 or a: MIN(LAT_N) x2 or c: (MAX(LAT_N) y1 or b: MIN(LONG_W) y2 or d: MAX(LONG_W) P1 (x1, y1) P1 (a, b) (MIN(LAT_N), MIN(LONG_W)) AS p1 P2(X2, Y2) P2 (c,d) ((MAX(LAT_N), MAX(LONG_W))) AS p2 Manhattan Distance: (x1 - x2) + (y1 - y2) (MIN(LAT_N) - (MAX(LAT_N)) + (MIN(LONG_W) - MAX(LONG_W)) */ SELECT ROUND((MIN(LAT_N) - MAX(LAT_N)) + (MIN(LONG_W) - MAX(LONG_W)), 4) AS manhattan_distance_p1_p2 FROM STATION;
67
Define the Euclidean Distance Formula, Explain why the Formula is written in the fashion it is, & Solve the following prompt: Consider P1 (a, c) and P2(b,d) to be two points on a 2D plane where (a,b) are the respective minimum and maximum values of Northern Latitude (LAT_N) and (c,d) are the respective minimum and maximum values of Western Longitude (LONG_W) in STATION. Query the Euclidean Distance between points P1 and P2 and format your answer to display 4 decimal digits.
Euclidean Distance: The distance between two points/the straight line distance. Point A is (x1, y1) on the graph & Point B is (x2, y2) on the graph. Euclidean Distance Formula: d = √(x2 - x1)² + (y2 - y1)² Why we square before adding: Distances in different dimensions may have opposite signs: Example: moving right (positive x) but down (negative y). -Without squaring, differences could cancel out: (5−0)+(0−5)=0, implying zero distance, which is wrong. -Squaring: √(5)² + (-5)² = √25+25= 7.07, which correctly shows the actual straight-line distance. Reminder: -5²=25 (Squaring a negative number always results in a positive number.) Conclusion regarding the Euclidean Distance Formula: -Square → makes all contributions positive and weighted equally. Square root → converts the result back to the original distance scale (not "squared units"). Query for Provided Prompt: /* P1 (a, c) (x1, y1) P2 (b, d) (x2, y2) a or x1: MIN(LAT_N) c or y1: MIN(LONG_W) b or x2: MAX(LAT_N) d or y2: MAX(LONG_W) Euclidean Distance: The distance between two points/the straight line distance. Point A is (x1, y1) on the graph & Point B is (x2, y2) on the graph. Euclidean Distance Formula: d = √(x2 - x1)² + (y2 - y1)² SQRT(number) POWER(base number, exponent) */ SELECT ROUND( SQRT( POWER( MAX(LAT_N) - MIN(LAT_N) , 2) + POWER( MAX(LONG_W) - MIN(LONG_W) , 2) ) , 4) AS euclidian_distance FROM STATION;
68
SQRT()
SQL, MySQL The SQRT() function returns the square root of a number. Syntax: SQRT(number) Parameter Description number required. A number to calculate the square root of. Must be greater than 0. Example: SELECT SQRT(13)
69
POWER()
SQL, MySQL The POWER() function returns the value of a number raised to the power of another number. Note: This function is equal to the POW() function. Both work in MySQL. Syntax: POWER (x, y) Parameter Description x Required. A number (the base). y Required. A number (the exponent). Example: Return 8 raised to the third power. SELECT POWER( 8, 3) AKA 8^3
70
Window Functions -What they do -5 common window functions -Example of a non-window aggregation vs a window aggregation -Example of a running total
Perform calculations (e.g., aggregation) on a specific subset of data, without losing the level of details of rows. That is, they keep every row without collapsing the dataset. 5 common windows functions: -Partitioned Aggregate -Running Total/Cumulative Sum -ROW_NUMBER -RANK() vs DENSE_RANK -Row-to Row Comparisons: LAG()/LEAD() Example 1: Dataset: emp_id department salary 1 Sales 5000 2 Sales 6000 3 Sales 7000 4 IT 8000 5 IT 9000 Using a Normal Aggregate (GROUP BY) SELECT department, AVG(salary) AS dept_avg_salary FROM employees GROUP BY department; Result: department dept_avg_salary Sales 6000 IT 8500 ❌ Problem: -You lost individual employee rows. -You only have one row per department. vs. Using a Window Function: SELECT emp_id, department, salary, AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary FROM employees; Result: emp_id department salary dept_avg_salary 1 Sales 5000 6000 2 Sales 6000 6000 3 Sales 7000 6000 4 IT 8000 8500 5 IT 9000 8500 ✅ Benefit: -Each row keeps its details (emp_id, salary) -We add an aggregate column without collapsing rows. Example 2: Same Value for All Rows Example If you omit PARTITION BY: SELECT emp_id, salary, SUM(salary) OVER () AS total_salary FROM employees; Results: emp_id salary total_salary 1 5000 35000 2 6000 35000 3 7000 35000 4 8000 35000 5 9000 35000 💡 This is still useful because now you can compute, for example, each employee's share of the total salary. salary / SUM(salary) OVER () AS pct_of_total Example 3: Running total Starting Data emp_id department salary 1 Sales 5000 2 Sales 6000 3 Sales 7000 4 Sales 8000 2️⃣ Running Total Query SELECT emp_id, salary, SUM(salary) OVER (ORDER BY emp_id) AS running_total FROM employees WHERE department = 'Sales'; 3️⃣ How the Window Works The window grows as we move down the rows (because of ORDER BY emp_id), and the sum is recalculated for each row. Step (Row) Window of Rows Considered SUM(salary) Row 1 (5000) 5000 Row 2 (5000, 6000) 11000 Row 3 (5000, 6000, 7000) 18000 Row 4 (5000, 6000, 7000, 8000) 26000 4️⃣ Final Output emp_id salary running_total 1 5000 5000 2 6000 11000 3 7000 18000 4 8000 26000
71
Partitioned Aggregate
Windows function. Adds a group-level metric to every row without collapsing rows. Example: Per Group Average: SELECT emp_id, department, salary, AVG(salary) OVER (PARTITION BY department) AS dept_avg FROM employees; Results (input → output): emp_id dept salary dept_avg 1 Sales 5000 6000 2 Sales 7000 6000 3 IT 8000 8500 4 IT 9000 8500
72
Running Total/Cumulative Sum
Windows function. Adds a cumulative sum in a defined order. Example: SELECT order_date, revenue, SUM(revenue) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_rev FROM sales_by_day; Results: date revenue running_rev 2025‑01‑01 100 100 2025‑01‑02 150 250 2025‑01‑03 200 450
73
ROW_NUMBER()
Windows function. Unique sequential number per partition (no ties). Example: SELECT emp_id, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn FROM employees; Results: emp_id dept salary rn 2 Sales 7000 1 1 Sales 5000 2 4 IT 9000 1 3 IT 8000 2 Use for picking “top 1 per group” (filter rn = 1).
74
????RANK() vs DENSE RANK()
Windows function. Rank within a partition; tie handling differs. Example: SELECT emp_id, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS drnk FROM employees; Results (ties at 7000): salary RANK DENSE_RANK 9000 1 1 7000 2 2 7000 2 2 5000 4 3 RANK skips numbers after ties; DENSE_RANK does not.
75
???LAG()/LEAD()
Windows function. Row-to-Row Comparisons. Peek previous/next row’s value (within a partition & order). Example: SELECT order_date, revenue, LAG(revenue, 1) OVER (ORDER BY order_date) AS prev_rev, LEAD(revenue, 1) OVER (ORDER BY order_date) AS next_rev, revenue - LAG(revenue, 1) OVER (ORDER BY order_date) AS delta_from_prev FROM sales_by_day; Results: o_date revenue prev_rev next_rev delta_from_prev 01‑01 100 NULL 150 NULL 01‑02 150 100 200 50 01‑03 200 150 NULL 50
76
???Handy Patterns & Notes for Windows Functions
Percent of total: revenue * 1.0 / SUM(revenue) OVER () AS pct_of_total Top‑N per group: use ROW_NUMBER() and filter (WHERE rn <= N). Reset windows by group: add PARTITION BY department (or your group key). Order matters: ORDER BY inside OVER() changes row context; add it for running totals, ranks, LAG/LEAD. Frame clauses: Be explicit for cumulative metrics: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
77
GROUP BY Functions
COUNT(expression) SUM(expression) AVG(expression) MIN(expression) MAX(expression)
78
WINDOW Functions -Define & -Name all types.
Window Functions perform calculations within a window. Aggregate Window Functions: COUNT(expression) SUM(expression) AVG(expression) MIN(expression) MAX(expression) Rank Window Functions: *ROW_NUMBER() *RANK() *DENSE_RANK() *CUME_DIST() *PERCENT_RANK() NTILE(number) Value (Analytics) Window Functions: LEAD(expression, offset, default) LAG(expression, offset, default) FIRST_VALUE(expression) NOTES: -All functions with an asterisk * are empty. No number or expression should be entered within its parentheses. -Expression: an argument that you pass to a function. Example of function & expression: AVG(Sales) OVER(PARTITION BY Category OVER OrderDate ROWS UNBOUNDED PRECEEDING) -AVG: Window Function -Sales: Expression NOTES cont'd: A function expression can be: -empty, e.g., RANK() OVER (ORDER BY OrderDate) -use a column header, e.g., AVG(Sales) OVER (ORDER BY OrderDate) -use a number, e.g., NTILE(2) OVER (ORDER BY OrderDate) -have multiple arguments, e.g., LEAD(Sales, 2, 10) OVER (ORDER BY OrderDate) -use conditional logic, e.g., SUM(CASE WHEN Sales>100 THEN 1 ELSE 0 END) OVER (ORDER BY OrderDate) NOTES cont'd: -We use "OVER()" to 1) tell SQL that we're using a window function, otherwise SQL will think we're using a regular GROUP BY function, e.g., AVG(Sales) OVER( & 2) define a window/subset of data. -Everything inside the OVER() function is optional, i.e., PARTITION BY, ORDER BY, the window frame (i.e., ROWS UNBOUNDED PRECEEDING)
79
GROUP BY Rule
All columns in SELECT must be included in GROUP BY So, for example, the below query would result in an error message. Prompt: Find the total sales for each product & provide other details like order ID and order date. SELECT OrderID, OrderDate, ProductID, SUM(Sales) TotalSales FROM Sales.Orders GROUP BY ProductID Solution: Use a Window function instead: SELECT OrderID, OrderDate, ProductID, SUM(Sales) OVER(PARTITION BY ProductID) TotalSalesByProduct FROM Sales.Orders Output: (No header) OrderID OrderDate ProductID TotalSalesByProduct 1 1 25-01-01 101 140 2 3 25-25-01-10 101 140 6 2 25-01-05 102 105
80
PARTITION BY
Operates like GROUP BY. It's just the wording that's used for Window functions. Example: Prompt: Find the total sales for each product & provide other details like order ID and order date. SELECT OrderID, OrderDate, ProductID, SUM(Sales) OVER(PARTITION BY ProductID) TotalSalesByProduct FROM Sales.Orders Output: (No header) OrderID OrderDate ProductID TotalSalesByProduct 1 1 25-01-01 101 140 2 3 25-25-01-10 101 140 6 2 25-01-05 102 105 So, PARTITION BY divides the rows into groups, based on the columns. Example: Data set: Month Product Sales Jan Bottle 20 Jan Caps 10 Jan Bottle 30 Feb Gloves 5 Feb Caps 70 Feb Gloves 40 SUM(Sales) OVER(): Calculation is done on the entire dataset. Month Product Sales Sum Jan Bottle 20 175 Jan Caps 10 175 Jan Bottle 30 175 Feb Gloves 5 175 Feb Caps 70 175 Feb Gloves 40 175 vs. SUM(Sales) OVER(PARTITION BY Month) : Calculation is done individually on each window. Month Product Sales Sum Window 1: Jan Bottle 20 175 Jan Caps 10 175 Jan Bottle 30 175 Window 2: Feb Gloves 5 175 Feb Caps 70 175 Feb Gloves 40 175 NOTE: You can partition the data in 3 main ways. 1) Without a PARTITION BY Total sales across all rows (i.e., the entire result set) SUM(Sales) OVER() 2) PARTITION BY a single column Total sales for each product SUM(Sales) OVER (PARTITION BY Product) 3) PARTITION BY combined rows Total sales for each combination of product and order status SUM(Sales) OVER (PARTITION BY Product, OrderStatus) Example of PARTITION BY a single column vs combined rows: PARTITION BY a single column: PARTITION BY Product SELECT OrderID, Product, OrderStatus, Sales, SUM(Sales) OVER (PARTITION BY Product) AS TotalSalesPerProduct FROM Orders; Output: OrderID Product OrderStatus Sales TotalSalesPerProduct 1 Apple Shipped 100 300 2 Apple Pending 50 300 3 Apple Shipped 150 300 4 Banana Shipped 200 400 5 Banana Pending 75 400 6 Banana Pending 125 400 Explanation: Here, we’re grouping rows only by Product: -All "Apple" rows are treated as one partition → total = 100 + 50 + 150 = 300 -All "Banana" rows are treated as one partition → total = 200 + 75 + 125 = 400 PARTITION BY combined rows: PARTITION BY Product, OrderStatus SELECT OrderID, Product, OrderStatus, Sales, SUM(Sales) OVER (PARTITION BY Product, OrderStatus) AS TotalSalesPerProductAndStatus FROM Orders; Output: OrderID Product OrderStatus Sales TotalSalesPerProductAndStatus 1 Apple Shipped 100 250 2 Apple Pending 50 50 3 Apple Shipped 150 250 4 Banana Shipped 200 200 5 Banana Pending 75 200 6 Banana Pending 125 200 Explanation: Now, we’re grouping rows by the combination of Product AND OrderStatus: -Apple + Shipped → 100 + 150 = 250 -Apple + Pending → 50 -Banana + Shipped → 200 -Banana + Pending → 75 + 125 = 200 NOTE: I believe the above example shows a combination of a fact (sales) AND dimensions (order status, product)
81
GROUP BY vs Window Functions
GROUP BY: Simple data analysis (aggregations) Window: Advanced data analysis (aggregations + details)
82
Syntax of Window Functions
Window Function + Partition Clause AND/OR Order Clause AND/OR Frame clause -Window Function: Calculation used on the window -Partition Clause: Divides the dataset into windows (partitions) -Order Clause: Sorts the data in a window -Frame Clause: Define a subset of rows in a window NOTES: -You must use ORDER BY function for the rank and value window functions, i.e., ---rank window functions: ROW_NUMBER, RANK, DENSE_RANK, CUME_DIST, PERCENT_RANK, NTILE, ---value window functions: LEAD, LAG, FIRST_VALUE. -You don't need the ORDER BY function for the aggregate window functions (i.e., COUNT, SUM, MIN, MAX). Example: AVG(Sales) OVER (PARTITION BY Category ORER BY OrderDate ROWS UNBOUNDED PRECEDING)
83
Partition Clause of Window Functions
Divides the dataset into windows (partitions) -UNBOUNDED FOLLOWING: the last row within a window. -UNBOUNDED PRECEDING: the first row within a window -N-PRECEDING: The n-th row before the current row. Example frames: ROWS BETWEEN CURRENT ROW n FOLLOWING -n should be a number. -The frame shifts down after each result. ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING -"UNBOUNDED FOLLOWING" is the last row. ROWS BETWEEN n PRECEEDING AND CURRENT ROW -n should be a number. ROWS BETWEEN UNBOUNDED PRECEEDING AND CURRENT ROW -"UNBOUNDED PRECEEDING" is the first row in the data set. ROWS BETWEEN n PRECEEDING AND n FOLLOWING -n should be a number. ROWS BETWEEN UNBOUNDED PRECEEDING AND UNBOUND FOLLOWING -Includes all data points.
84
RANK ()
Window Function, one of the rank functions. NOTES: - RANK() expression is always blank, i.e., don't enter anything in the parentheses. -You must use ORDER BY function for the rank and value window functions, i.e., ---rank window functions: ROW_NUMBER, RANK, DENSE_RANK, CUME_DIST, PERCENT_RANK, NTILE, ---value window functions: LEAD, LAG, FIRST_VALUE. Example 1: RANK() OVER (PARTITION BY Month ORDER BY Sales DESC) Original Dataset: Month Product Sales Jan Bottle 20 Jan Caps 10 Jan Bottle 30 Feb Gloves 5 Feb Caps 70 Feb Gloves 40 Output: Month Product Sales Rank Jan Bottle 30 1 Jan Caps 20 2 Jan Bottle 10 3 ------ Feb Gloves 70 1 Feb Caps 40 2 Feb Gloves 5 3 Example 2: Prompt: Rank each order based on their sales from highest to lowest & provide details such as order ID and order date. SELECT OrderID, OrderDate, Sales RANK() OVER (ORDER BY Sales DESC) RankSales FROM Sales.Orders Output: (Blank) OrderID OrderDate Sales RankSales 1 8 25-02-18 90 1 2 4 25-1-20 60 2 3 10 25-03-15 60 2 5 6 25-02-05 50 4
85
Window Function, Frame Clause
Defines a subset of rows within each window that is relevant for the chosen calculation. frame, visual with written explanation: { [ () ] } Entire Data>PARTITION BY: Window 1 & Window 2>Frame: frame within window 1 & frame within window 2 {}: Entire Dataset []: Partition BY (): Frame Syntax: AVG(Sales) OVER (PARTITION BY Category ORDER BY OrderDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING) Frame Types: ROWS or RANGE (above syntax example uses ROWS) Frame Boundary (lower value): CURRENT ROW, N PRECEEDING, or UNBOUNDED PRECEDING (The above syntax example uses CURRENT ROW) Frame Boundary (higher value): CURRENT ROW, N FOLLOWING, or UNBOUNDED FOLLOWING (The above syntax example uses UNBOUNDED PRECEDING) NOTES: -You must use ORDER BY if you want to use the frame clause (i.e., ROWS or RANGE) -The lower values must be BEFORE the higher boundary.
86
CURRENT ROW in the window function frame clause
Example: SUM(Sales) OVER(ORDER BY Month ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) AS Result In a window frame like ROWS BETWEEN CURRENT ROW AND …, “CURRENT ROW” means “the row currently being evaluated in the window,” not “the first row of the data set.” A few key points: -It’s relative and re-evaluated per row. For each output row, the “current row” is that row itself. So the frame starts at different places as you move down the result. -Partitions matter. “Current row” is defined within the window’s PARTITION BY (if any). It’s not the first row of the whole table—only the first of its partition when the evaluation is on that first row. -Order matters. The “current row” is with respect to the window’s ORDER BY. Only when you’re evaluating the first ordered row of a partition does “current row” coincide with the first row. -ROWS vs RANGE. With ROWS, CURRENT ROW refers to that single physical row. With RANGE, CURRENT ROW typically expands to include all “peer” rows that tie on the ORDER BY value.
87
Window Function, Frame Clause Example: SUM(Sales) OVER(ORDER BY Month ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) AS Result
Example: SUM(Sales) OVER(ORDER BY Month ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) AS Result Original Dataset: Month Sales Jan 20 Feb 10 Mar 30 Apr 5 Jun 70 Step 1: Month Sales Result Current Row: Jan 20 60 Feb 10 2. Following: Mar 30 Apr 5 Jun 70 Step 2: Frame shifts down a row. Month Sales Result Jan 20 60 Current Row: Feb 10 45 Mar 30 2. Following: Apr 5 Jun 70 *Step 3: Frame shifts down another row. Month Sales Result Jan 20 60 Feb 10 45 Current Row: Mar 30 105 Apr 5 2. Following: Jun 70 *Step 4: Frame shifts down another row Month Sales Result Jan 20 60 Feb 10 45 Mar 30 105 Current Row: Apr 5 75 Jun 70 2. Following: *Step 5: Frame shifts down another row Month Sales Result Jan 20 60 Feb 10 45 Mar 30 105 Apr 5 75 Current Row: Jun 70 70 2. Following:
88
Window Function, Frame Clause Example: SUM(Sales) OVER (ORDER BY Month ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
Example: SUM(Sales) OVER (ORDER BY Month ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) Month Sales Result Jan 20 Current Row: Feb 10 115 Mar 30 Apr 5 UNBOUNDED FOLLOWING: Jun 70
89
Window Function, Frame Clause Example: SUM(Sales) OVER(ORDER BY Month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
Example: SUM(Sales) OVER(ORDER BY Month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -UNBOUNDED PRECEDING: the first row within a window Step 1: Month Sales Result UNBOUND PROCEDING Jan 20 Feb 10 CURRENT ROW: Mar 30 60 Apr 5 Jun 70 Step 2: Frame shifts down. Note that the frame will always be fixed at the first row, because the frame specifies UNBOUNDED PRECEDING. Month Sales Result UNBOUND PRECEDING Jan 20 Feb 10 Mar 30 60 CURRENT ROW: Apr 5 65 Jun 70
90
Window Function, Frame Clause Example: SUM(Sales) OVER (ORDER BY Month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
Example: SUM(Sales) OVER (ORDER BY Month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) Month Sales Result Jan 20 PRECEDING Feb 10 CURRENT ROW: Mar 30 45 1 FOLLOWING: Apr 5 Jun 70
91
Window Function, Frame Clause Example: SUM(Sales) OVER (ORDER BY Month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUND FOLLOWING)
Example: SUM(Sales) OVER (ORDER BY Month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUND FOLLOWING) UNBOUNDED FOLLOWING: the last row within a window. UNBOUNDED PRECEDING: the first row within a window Month Sales Result UNBOUNDED PRECEDING Jan 20 135 Feb 10 135 CURRENT ROW Mar 30 135 Apr 5 135 UNBOUNDED FOLLOWING Jun 70 135
92
Window Function, Frame Clause Example: SELECT OrderID, OrderDate, OrderStatus, Sales, SUM(Sales) OVER (PARTITION BY OrderStatus ORDER BY OrderDate ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) TotalSales FROM Sales.Orders
Output: (Blank) OrderID OrderDate OrderStatus Sales TotalSales 1 1 25-01-01 Delivered 10 55 2 3 21-01-10 Delivered 20 95 3 5 25-02-01 Delivered 25 105 4 6 25-02-05 Delivered 50 80 5 7 25-02-15 Delivered 30 30 6 2 25-01-05 Shipped 15 165 Notes: -Row 4 is just 50+ 30 & Row 5 is just 30, because row 5 is the last record of the current partition.
93
Window Function, Compact Frame
Longhand: ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING Shorthand: ROWS 2 FOLLOWING NOTES: -Compact frames only work with preceding or unbounded preceding. (It does not work with unbounded following or following.) -Compact frame does not seem to be a standard term in SQL. (Only saw it referenced by Baraa and nowhere else.) Longhand example: SELECT OrderID, OrderDate OrderStatus, Sales, SUM(Sales) OVER (PARTITION BY OrderStatus ORDER BY OrderDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) TotalSales FROM Sales.Orders Shorthand example (with same results as above longhand example): SELECT OrderID, OrderDate, OrderStatus, Sales, SUM(Sales) OVER (PARTITION BY OrderStatus ORDER BY OrderDate ROWS 2 PRECEDING) TotalSales FROM Sales.Orders
94
Window Function, Default Frame
SQL uses a default frame if ORDER BY is used without a frame. Note: Even though no frame is specified in the below example, you can tell there is a hidden frame. Example: SELECT OrderID, OrderDate, OrderStatus, Sales SUM(Sales) OVER (PARTITION BY OrderStatus ORDER BY OrderDate) TotalSales FROM Sales.Orders; SQL Background Process: 1. The data is partitioned by OrderStatus (i.e., by Delivered and Shipped). 2. SQL orders the rows in each respective partition by OrderDate ASC 3. Default Frame: UNBOUNDED PRECEDING AND CURRENT ROW. UNBOUNDED PRECEDING: First row in each partition. Output: (Blank) OrderID OrderDate OrderStatus Sales TotalSales Start of Frame1: 1 1 25-01-01 Delivered 10 10 2 3 21-01-10 Delivered 20 30 3 5 25-02-01 Delivered 25 55 4 6 25-02-05 Delivered 50 105 End of Frame1: 5 7 25-02-15 Delivered 30 135 Start of Frame2 6 2 25-01-05 Shipped 15 15 7 4 25-01-20 Shipped 60 75
95
4 Rules of Window Functions
1) Window functions can only be used in the SELECT and ORDER BY clauses. Example, SELECT clause: SELECT OrderID, OrderDate, OrderStatus, Sales, SUM(Sales) OVER (PARITION BY OrderStatus) TotalSales FROM Sales.Orders Example, ORDER BY clause: SELECT OrderID, OrderDate, OrderStatus, Sales, SUM(Sales) OVER(PARTITION BY OrderStatus) TotalSales FROM Sales.Orders ORDER BY SUM(Sales) OVER (PARTITION BY OrderStatus) Note: Window functions can't be used to filter data. The following results in an error message: WHERE SUM(Sales) OVER (PARTITION BY OrderStatus) > 100 2) Nesting window functions is not allowed. The below example would result in an error message. SELECT OrderID, OrderDate, OrderStatus, Sales, SUM( SUM(Sales) OVER (PARITION BY OrderStatus) )TotalSales FROM Sales.Orders 3) SQL executes window functions AFTER the WHERE clause. So, SQL will execute WHERE ProductID IN (101, 102) BEFORE it executes SUM(Sales) OVER (PARITION BY OrderStatus) TotalSales SELECT OrderID, OrderDate, OrderStatus, Sales, SUM(Sales) OVER (PARTITION BY OrderStatus) TotalSales FROM Sales.Orders WHERE ProductID IN (101, 202) Step 1: SQL will pull all requested columns where product ID is 101 or 102 Step 2: SQL will partition the results by OrderStatus (i.e., Delivered vs Shipped) Step 3: SQL will sum the sales for each partition. Output: (Blank) OrderID OrderDate OrderStatus Sales TotalSales 1 3 25-01-10 Delivered 20 60 2 7 21-02-15 Delivered 30 60 3 1 25-01-01 Delivered 10 60 4 2 25-01-05 Shipped 15 185 5 8 25-02-18 Shipped 90 185 6 9 25-03-10 Shipped 20 185 7 10 25-03-15 Shipped 60 185 4) A Window function can be used together with the GROUP BY in the same query only if the same columns are used. Example: -- Rank customers based on their total sales SELECT CustomerID, SUM(Sales) TotalSales, RANK() OVER(ORDER BY SUM(Sales) DESC) RankCustomers FROM Sales.Orders GROUP BY CustomerID Output: (Blank) ColumnID TotalSales RankCustomers 1 3 125 1 2 1 110 2 3 4 90 3 4 2 55 4
96
GROUP BY (not a window function)
Use GROUP BY for simple aggregations. (Use window functions for complex aggregations.) Adding more columns to the GROUP BY function splits the data into more granular groups, changing the number of rows you get back. You can think of GROUP BY as defining "buckets" into which SQL groups rows BEFORE applying aggregate functions like SUM or COUNT or AVG. More columns = more buckets. Example Dataset: OrderID Customer Product OrderStatus Sales 1 Alice Laptop Shipped 1000 2 Alice Mouse Shipped 20 3 Bob Laptop Pending 1000 4 Bob Mouse Shipped 25 5 Alice Laptop Shipped 900 Query A: Group by one value SELECT Customer, SUM(Sales) AS TotalSales FROM Sales.Orders GROUP BY Customer; Output Customer TotalSales Alice 1920 Bob 1025 Here, all rows for the same customer are grouped together, ignoring product or order status. Query B: Group by two values SELECT Customer, Product, SUM(Sales) AS TotalSales FROM Sales.Orders GROUP BY Customer, Product; Output Customer Product TotalSales Alice Laptop 1900 Alice Mouse 20 Bob Laptop 1000 Bob Mouse 25 Now we split the buckets further, grouping by both customer and product. Instead of 2 rows, we get 4 rows because each combination is a separate group. Query C: Group by even more values SELECT Customer, Product, OrderStatus, SUM(Sales) AS TotalSales FROM Sales.Orders GROUP BY Customer, Product, OrderStatus; Output Customer Product OrderStatus TotalSales Alice Laptop Pending 900 Alice Laptop Shipped 1000 Alice Mouse Shipped 20 Bob Laptop Pending 1000 Bob Mouse Shipped 25 Here, grouping becomes even more granular because we're differentiating based on status too. Key rules to remember: -Only non-aggregated columns need to appear in GROUP BY. -Columns that are wrapped in an aggregate function do not need to be listed there. Key Takeaways -GROUP BY does not filter data, it aggregates rows into groups. -Every column in SELECT that’s not inside an aggregate function must appear in the GROUP BY, otherwise SQL doesn’t know how to collapse multiple rows into one. -Adding more columns in GROUP BY splits the groups into smaller subsets, producing more rows. -If you want to filter groups, you do that with: ---WHERE (filters before grouping) ----HAVING (filters after grouping)
97
WHERE vs. HAVING
WHERE and HAVING can filter results, but they’re applied at different stages of query processing. -WHERE filters rows before grouping or aggregation happens. -HAVING filters groups after aggregation has been calculated. Tip: Think of WHERE as row-level filtering, while HAVING is group-level filtering. How SQL processes a query (order of execution): 1) FROM -- Identify source tables and joins 2) WHERE -- Filter individual rows (raw data) 3) GROUP BY -- Put the rows in buckets 4) Aggregate functions -- Calculate aggregations (SUM, COUNT, AVG, MIN, & MAX) 5) HAVING -- Filter groups based on aggregated results 6) SELECT -- Return chosen columns or expressions (can now include aggregates) 7) DISTINCT -- Remove duplicate rows from the final result set. 8) ORDER BY -- Sort the results 9) TOP/LIMIT -- Return only a subset of rows 7) ORDER BY – sort the output Example 1 (WHERE) & 2 (HAVING) Dataset: OrderID Customer Product OrderStatus Sales 1 Alice Laptop Shipped 1000 2 Alice Mouse Shipped 20 3 Bob Laptop Pending 1000 4 Bob Mouse Shipped 25 5 Alice Laptop Pending 900 Example 1: WHERE SELECT Customer, Product, SUM(Sales) AS TotalSales FROM Sales.Orders WHERE OrderStatus = 'Shipped' -- row-level filter GROUP BY Customer, Product ORDER BY Customer, Product; Output (only shipped rows are summed): Customer Product TotalSales Alice Laptop 1000 Alice Mouse 20 Bob Mouse 25 Takeaways: -WHERE filters rows before grouping -Only rows where OrderStatus='Shipped' are included in the aggregation. -Alice’s Pending laptop sale (900) is excluded, so her Laptop total is 1000. vs. Example 2: HAVING SELECT Customer, Product, SUM(Sales) AS TotalSales FROM Sales.Orders GROUP BY Customer, Product HAVING SUM(CASE WHEN OrderStatus = 'Shipped' THEN 1 ELSE 0 END) > 0 -- group-level filter ORDER BY Customer, Product; Output (groups with any shipped row are kept; totals include all rows in the group): Customer Product TotalSales Alice Laptop 1900 Alice Mouse 20 Bob Mouse 25 Takeaways: -HAVING keeps groups that have at least one shipped row -All rows are aggregated; we then keep groups where there is any shipped row. -The totals, however, include all statuses because WHERE did not filter them out. -Because the group (Alice, Laptop) has at least one Shipped row, it’s kept—but the SUM(Sales) adds both the shipped 1000 and the pending 900, yielding 1900. -Bob’s Laptop group has no shipped rows, so that group is removed entirely. Why the WHERE and HAVING queries differ: -WHERE removes non-shipped rows before grouping → shipped-only totals. -HAVING can test for “shipped-ness” after grouping (using a conditional aggregate) → keeps groups that have shipped rows, but totals may still include non-shipped rows unless you also make the SUM conditional. Using WHERE and HAVING together in a query SELECT Customer, Product, SUM(Sales) AS TotalSales FROM Sales.Orders WHERE OrderStatus = 'Shipped' -- filter individual rows GROUP BY Customer, Product HAVING SUM(Sales) > 500; -- filter aggregated groups Output: Customer Product TotalSales Alice Laptop 1000 Alice Mouse 20 Bob Mouse 25 Here: -WHERE ensures only shipped orders are grouped. -HAVING removes any group below a total of 500 (so “Mouse” rows would actually drop if we set HAVING > 500). ✅ Summary Clause Filters data at… Can use aggregates? WHERE Row level (before GROUP) ❌ No HAVING Group level (after GROUP) ✅ Yes
98
You should never put an aggregated function in the GROUP BY clause, only the SELECTED values. Why?
It doesn’t need to be listed in GROUP BY—and in fact, you should not put it there when it’s inside an aggregate function like SUM(). Here’s why: 1️⃣ Two categories of columns in a SELECT with GROUP BY: When you write a SELECT with GROUP BY, every column falls into one of two buckets: -Bucket A: Grouping columns – these appear in GROUP BY because you want one row per unique combination of these values. -Bucket B: Aggregate expressions – these calculate something across each group (e.g., SUM(Sales), COUNT(*), MAX(Date)). SQL says: Every selected column must be either be: -in the GROUP BY (defining the group), or -wrapped in an aggregate function (defining a calculation over that group). Sales is not listed in GROUP BY in the below example because we’re not grouping by individual sales rows, we’re summing them up. If you added Sales to the group, you’d undo the aggregation because you'd group by individual sales values instead of combining them. 2️⃣ Example: What happens if you (incorrectly) add Sales? SELECT Customer, Sales, SUM(Sales) AS TotalSales FROM Sales.Orders GROUP BY Customer, Sales; Output: Customer Sales TotalSales Alice 1000 1000 Alice 900 900 Alice 20 20 Bob 1000 1000 Bob 25 25 Here, each distinct Sales value becomes its own group. The SUM doesn’t actually combine Alice’s orders into 1920 because you forced SQL to keep them separate. This defeats the purpose of aggregation. ✅ Correct approach: SELECT Customer, SUM(Sales) AS TotalSales FROM Sales.Orders GROUP BY Customer; Now Sales is aggregated, not grouped by individual amounts. Output: Customer TotalSales Alice 1920 Bob 1025 Key rules to remember: -Only non-aggregated columns need to appear in GROUP BY. Columns that are wrapped in an aggregate function do not (and SHOULD NOT) be listed there.
99
How SQL processes a query
A simplified logical order of execution: 1) FROM -- Identify source tables and joins 2) WHERE -- Filter individual rows (raw data) 3) GROUP BY -- Put the rows in buckets 4) Aggregate functions -- Calculate aggregations (SUM, COUNT, AVG, MIN, & MAX) 5) HAVING -- Filter groups based on aggregated results 6) SELECT -- Return chosen columns or expressions (can now include aggregates) 7) DISTINCT -- Remove duplicate rows from the final result set. 8) ORDER BY -- Sort the results 9) TOP/LIMIT -- Return only a subset of rows
100
Explain when it's better to use WHERE and when it's better to use HAVING to filter data AND Identify which of the below prompts should use WHERE and which should use HAVING in each respective query. And explain why. Prompt 1: "Show me total sales per customer, but only for orders that were shipped." Prompt 2: "Show me customers who have more than $2000 in total sales across all their orders."
-Use WHERE when you need to filter individual rows BEFORE aggregation. -Use HAVING when you need to filter AFTER aggregation, especially based on the results of aggregate functions (e.g., SUM, COUNT, AVG, MIN, MAX). Prompt 1: "Show me total sales per customer, but only for orders that were shipped." Answer: Use WHERE Prompt 2: "Show me customers who have more than $2000 in total sales across all their orders." Answer: Use HAVING Prompt 1: "Show me total sales per customer, but only for orders that were shipped." SELECT Customer, SUM(Sales) AS TotalSales FROM Sales.Orders WHERE OrderStatus = 'Shipped' -- filter rows first GROUP BY Customer; Why WHERE? -We know which rows we want before grouping (OrderStatus='Shipped'), so filtering early reduces unnecessary work and avoids counting irrelevant rows. Prompt 2: "Show me customers who have more than $2000 in total sales across all their orders." SELECT Customer, SUM(Sales) AS TotalSales FROM Sales.Orders GROUP BY Customer HAVING SUM(Sales) > 2000; -- filter after aggregation Why HAVING? -We can’t know which customers meet this condition until after we’ve grouped and summed their sales. WHERE cannot filter on SUM(Sales) because that value doesn’t exist until after grouping.
101
Standard SQL
a lot of analysts start with Standard SQL and then adjust the dialect they use based on what database they are working with. Standard SQL works with a majority of databases and requires a small number of syntax changes to adapt to other dialects. As a junior data analyst, it is important to know that there are slight differences between dialects. But by mastering Standard SQL, which is the dialect you will be working with in this program, you will be prepared to use SQL in any database.
102
Databases & SQL Dialects
A database is a computer program that can store and process large amounts of data. There are many different vendors of databases. PostgreSQL, MySQL, Oracle, and SQL Server are all database products by different vendors. SQL is the programming language used to talk to these databases, and each database product has its own variant of SQL. We call these variants SQL dialects.
103
NoSQL databases
As the amount of data in databases grew, NoSQL databases started gaining popularity. MongoDB, Cassandra, and Redis are among the most widely used NoSQL tools. Interestingly, most of these tools also include the capabilities for retrieving data with SQL. For that reason, some people prefer to call it “Not only SQL” rather than “NoSQL.”
104
SQL Dialects
Standard SQL. MySQL Microsoft SQL Server/MS SQL/SQL Server PostgreSQL Transact-SQL, often abbreviated “T-SQL,” is the dialect of SQL used in the Microsoft SQL Server database. Simply put, the SQL Server’s version of SQL has its own proper name. No other database has a proper name for their SQL dialect. What about PL/SQL and PL/pgSQL? These are languages with which to write your own functions. When you write a piece of SQL code, you usually use the constructions and functions that are predefined in your database. However, all databases allow you to create your own functions or stored procedures. These are called user-defined functions (UDF) and user-defined stored procedures (USP). The language in PostgreSQL is called PL/pgSQL. PostgreSQL’s version was inspired by Oracle’s version; that’s why the names are similar.
105
Data Measurements
Byte (B): 8 bits Example (with approximate size): 1 character in a string (1 byte) Kilobyte (KB): 1024 bytes Example (with approximate size): A page of text (4 kilobytes) Megabyte (MB): 1024 Kilobytes Example (with approximate size): 1 song in MP3 format (2-3 megabytes) Gigabyte (GB) 1024 Megabytes Example (with approximate size): 300 songs in MP3 format (1 gigabyte) Terabyte (TB): 1024 Gigabytes Example (with approximate size): 500 hours of HD video (1 terabyte) Petabyte (PB): 1024 Terabytes Example (with approximate size): 10 billion Facebook photos (1 petabyte) Exabyte (EB): 1024 Petabytes Example (with approximate size): 500 million hours of HD video (1 exabyte) Zettabyte (ZB): 1024 Exabytes Example (with approximate size): All the data on the internet in 2019 (4.5 ZB)
106
BigQuery, FROM
Anytime an uploaded local dataset name is typed in the FROM section of a BigQuery query, the dataset and data table file path name will always be preceded by the project name. Here is a template of the file path structure: personal project name.customer_data.customer_address Where, your personally customized project name will be inserted before the dataset name. If you don't insert your personal project name, it may cause an error to occur in your query. Keep this technique in mind for future lessons anytime you are typing a local file path name in the FROM section of a query. Example: SELECT   DISTINCT fuel_type FROM   your project name.cars.car_info LIMIT 1000 FROM clause of the syntax above, you will need to begin the Table ID line with your personalized project name, period, the dataset name, period, and end with the table name.
107
BigQuery, SUBSTR() function
The SUBSTRING function in SQL is used to extract a portion of a string (a substring) from a larger string. Example: SUBSTR(country, 1, 2) country: the column 1: start with the first letter 2: pull two letters Output: US (not USA)
108
BigQuery, CAST()
Can be used to convert anything from one data type to another. Example: CAST(purchase_price AS FLOAT64) BigQuery stores FLOAT as "FLOAT64". That isn't the case in other SQL platforms. (It just means that were casting numbers in the 64-bit system as FLOATs)
109
BigQuery, FLOAT
A number that contains a decimal
110
Typecasting
Converting data from one type to another.
111
BigQuery, CONCAT()
Adds strings together to create new text strings that can be used as unique keys.
112
BigQuery, COALESCE()
Can be used to return non-null values in a list. Example: COALESCE(product, product_code) AS product_info product: The first column to check product_code: The column to check if the first column value is null In the above example, we wanted a list of product names, but if the name isn't available, we want the product code.
113
BigQuery, INSERT INTO
Adds new data to a database
114
BigQuery, UPDATE
Change existing data in a database
115
BigQuery, DELETE
Remove data from a database
116
Verification
A process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable.
117
Change Log/Changelog
A file containing a chronologically ordered list of modifications made to a project.
118
BigQuery, Example of a CASE statement to clean a dataset
SELECT Customer_id, CASE WHEN first_name = 'Tnoy' THEN 'Tony' ELSE first_name END AS cleaned_name FROM project-id.customer_data.customer_name
119
Remove Duplicates
A tool that automatically searches for and eliminates duplicate entries from a spreadsheet
120
BigQuery, CAST function Common Conversions
Numeric CAST function can convert to: - Integer - Numeric (number) - Big number - Floating integer - String String CAST function can convert to: - Boolean - Integer - Numeric (number) - Big number - Floating integer - String - Bytes - Date - Date time - Time - Timestamp Date CAST function can convert to: - String - Date - Date time - Timestamp
121
ANSI
American National Standards Institute
122
BigQuery, CAST function
Syntax: CAST(expression AS typename) Example: SELECT CAST(MyCount AS STRING) FROM MyTable
123
SAFE_CAST
Using the CAST function in a query that fails returns an error in BigQuery. To avoid errors in the event of a failed query, use the SAFE_CAST function instead. The SAFE_CAST function returns a value of Null instead of an error when a query fails. Syntax (same as for CAST): SELECT SAFE_CAST(MyDate AS STRING) FROM MyTable
124
BigQuery, INSERT INTO function
a method you can use to import data from one table to another is to use the INSERT INTO command together with a SELECT statement. Syntax: INSERT INTO [destination_table_name] SELECT [column names, separated by commas, or * for all columns] FROM [source_table_name] WHERE [condition]
125
BigQuery, Concatenate function
Join together two or more text strings, such as combining street addresses and primary contacts in a business’ vendor database. You can add multiple items by separating them with commas. Syntax: =CONCATENATE(item 1, item 2) Where appropriate, such as when you’re combining a customer’s first and last name, you should add a space between the items you’re combining by typing quotation marks space quotation marks [“ ”] between the items. Separate this information by a comma as well. This would change the formula to: =CONCATENATE(item 1, " ", item 2).
126
BigQuery, Concat function
In SQL, use the CONCAT function to join strings together to create new text strings. You might combine data simply to improve the readability of reports (such as combining a customer’s first and last name when generating a customer list). Or, you might combine data to generate a unique identifier for the rows in a table. Syntax: In SQL, use the CONCAT function to join strings together to create new text strings. You might combine data simply to improve the readability of reports (such as combining a customer’s first and last name when generating a customer list). Or, you might combine data to generate a unique identifier for the rows in a table. Example: CONCAT('Google', '.com') Output: Google.com
127
BigQuery, Why would you CAST an integer as int64?
BigQuery stores numbers in a 64-bit memory system, which is why there's a 64 after integer in this case.
128
BigQuery, LIMIT
LIMIT function should appear at the end of the query to limit results (e.g., LIMIT 1000 or LIMIT 100 or LIMIT 10) Example: SELECT usertype, CONCAT(start_station_name, " to ", end_station_name) AS route, COUNT(*) AS num_trips, ROUND(AVG(CAST(tripduration AS int64)/60), 2) AS duration_minutes FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY start_station_name, end_station_name, usertype ORDER BY num_trips DESC LIMIT 10
129
BigQuery, CONCAT vs CONCATENATE
CONCAT: You can only combine two items (e.g., JeffBaker). NOTE: A space would be a third item, so JeffBaker is smooshed together. CONCATENATE: You can combine multiple items (e.g., May 30, 1955)
130
BigQuery, CONCAT_WS
CONCAT With Separator It's similar to CONCAT, but it includes a separator, such as a space or period, between the strings. Example: SELECT CONCAT_WS('.', 'www', 'your_company', 'com') as website FROM web_data; The output includes a period ('.') between each part of the website URL, ensuring that the URL is in correct, navigable format. Example: CONCAT_WS (' . ', 'www', 'google', 'com') Output: www.google.com
131
BigQuery, CONCAT with | |
In BigQuery, you can use the || operator to concatenate strings. For instance, if you're working with a dataset of book information and want to create a full title by combining the book's name and its edition, you could use ||, like so: SELECT book_name || ' - ' || edition AS full_book_title FROM library; Ouput: Pride & Prejudice-first edition Note: In some other SQL environments, you cannot use the || operator to concatenate strings. You must use + instead. For example, to concatenate the strings 'Google' and '.com' in Microsoft SQL server, you would use: SELECT 'Google' + '.com' Example: 'Google' || '.com' Output: Google.com
132
SQL environment
Traditional Relational Database Management Systems (RDBMS): BigQuery, MySQL, Postgre SQL, Microsoft SQL Server, Oracle Database, MariaDB, SQ Lite Cloud Data Warehouses & Analytics Platforms: Snowflake, Amazon Redshift, Azure Synapse Analytics Other Data Platforms with SQL Interfaces: Apache Hive, Presto/Trino
133
BigQuery, CONCATENATE vs CONCAT vs CONCAT_WS vs | |
CONCATENATE: Concatenate multiple strings. (more than 2 items) =CONCATENATE(item 1, " ", item 2) CONCAT: Concatenate strings to create new text strings. (2 items max) Example: =CONCAT('Google', '.com') Output: Google.com CONCAT_WS: Concatenate two or more strings together with a separator between each string Example: =CONCAT_WS(' . ', 'www', 'google', 'com') Output: www.google.com ||: Concatenate two or more strings together with the || operator Example: 'Google' || '.com' Output: Google.com
134
BigQuery, MOD()
MOD function calculates the remainder of a division operation. Syntax: MOD(x, y) X:The dividend (the number being divided) Y: The divisor (the number by which x is divided Example 1: MOD(10, 3) Output: 1 (10 divided by 3 is 3 with a remainder of 1) Example 2: MOD(19, 4) Output: 3 (19 divided by 4 is 4 with a remainder of 3) Use cases: -Determining Divisibility: checking if a number is divisible by another. -Data Distribution/Bucketing: Distributing data into groups based on a numeric pattern (e.g., MOD(item_id, bucket_count) assigns items to specific buckets). -Pattern Analysis: Identifying and analyzing repeating patterns in data, such as in time series.
135
Dividend
The number being divided. For example: 20/2 20 is the dividend.
136
Divisor
The number by which a number is divided. For example: 20/2 2 is the divisor.
137
Array
A programming concept. It's a collection of values. Example: An array named my_fruits has the values 'banana', 'apple' and 'orange' stored inside of it. Each value in the array has a position (called an index) which starts at 0. my_fruits = ['banana', 'apple', 'orange'] Name of the array: my_fruits Values: banana, apple, orange Indexes: 0 (banana), 1 (apple), and 2 (orange) Purpose? Arrays make it easier to work with groups of values compared to using a separate variable for each value. So, instead of creating 3 values (e.g., fruit1 = 'banana' fruit2 = 'apple' fruit3 = 'orange), you care an array (e.g., my_fruits = ['banana', 'apple', 'orange'] With an array, you can: -Store a collection of numbers, words, or objects -Access any value using its index (i.e., its position) -Read, update, insert, or remove any of the array values
138
INNER JOIN/JOIN
A function that returns records with matching values in both tables There must be key values in both tables. The records will only merge if there are matches in both tables.
139
LEFT JOIN/LEFT OUTER JOIN
A function that returns all the records from the left table (first mentioned) and only the matching records from the right table (second mentioned). Each row in the left table will appear in the results, even if there is no corresponding match in the right table.
140
RIGHT JOIN/RIGHT OUTER JOIN
A function that returns all records from the right table (second mentioned) and only the matching records from the left table (first mentioned). Each row in the right table will appear in the results, even if there is no corresponding match in the right table.
141
OUTER JOIN/FULL OUTER JOIN
A function that combines the RIGHT JOIN and LEFT JOIN to return all matching records in both tables. It will return all records in both tables.
142
JOIN
A SQL clause that is used to combine rows from two or more tables based on a related column. Joins help you join matching or related columns from different tables.
143
Common Joins
Inner Left Right Outer
144
Primary Keys
Reference columns in which each value is unique. You can only have one primary key per table.
145
Foreign Keys
Primary keys in other tables. You can have multiple foreign keys in a table.
146
How to tell which table is LEFT and which table is RIGHT in JOINs?
In English and SQL, we read from left to right. So, the table mentioned first is left and the table mentioned second is right.
147
Syntaxes for Column & TableAliases
Column Aliases: Use AS Ex// SELECT column_name AS alias_name FROM table_name; If using AS results in an error when running a query because the SQL database you are working with doesn't support it, you can leave it out. In the previous examples, the alternate syntax for aliasing a table: -FROM table_name alias_name -SELECT column_name alias_name Notes: -If the alias contains spaces, you need to put it inside single quotes (or double quotes) like this: column_name AS 'Alias Name' -If a query contains expressions, you can assign column aliases to the expressions. For example: SELECT first_name, last_name, salary * 1.1 AS new_salary FROM employees; Table Aliases: table_name AS table_alias table_name table_alias OR Notes: Assigning an alias to a table does not rename the table permanently. It just gives the table another name temporarily during the execution of a query.
148
Common Errors with Aliases
Since you assign aliases to columns in the SELECT clause, you can only reference the aliases in the clauses that are evaluated after the SELECT clause. The following query will result in an error: SELECT first_name, last_name, salary * 1.1 AS new_salary FROM employees WHERE new_salary > 5000 Error: Unknown column 'new_salary' in 'where clause' Code language: JavaScript (javascript) Why? In this SELECT statement, the database evaluates the clauses in the following order: FROM > WHERE > SELECT Code language: SQL (Structured Query Language) (sql) The database evaluates the WHERE clause before the SELECT clause. Therefore, at the time it evaluates the WHERE clause, the database doesn’t have the information of the new_salary column alias. So it issued an error. However, the following query works correctly: SELECT first_name, last_name, salary * 1.1 AS new_salary FROM employees ORDER BY new_salary; Code language: SQL (Structured Query Language) (sql) Try it In this example, the database evaluates the clauses of the query in the following order: FROM > SELECT > ORDER BY Code language: SQL (Structured Query Language) (sql) The database evaluates the SELECT clause before the ORDER BY clause. Therefore, at the time of evaluating the ORDER BY clause, the database has the information of the new_salary alias, which is the alias of the expression salary * 1.1. Hence, it works as expected
149
Subquery
A SQL query that is nested inside a larger query. Note: -All of the logic of the outer query depends on the logic of the inner query. -The inner most query always executes first. -Standardly subqueries are nested in FROM or WHERE clauses.
150
Name the parts of a query & subquery
Outer Query/Outer Select Inner Query/Inner Select
151
HAVING
Allows you to add a filter to your query instead of the underlying table that can only be used with aggregate functions.
152
CASE
Returns records with your conditions by allowing you to include if/then statements in y our query.
153
Subquery Rules
Subqueries must be enclosed within parentheses. A subquery can have one or more columns specified in the SELECT clause. Subqueries that return more than one row can only be used with multiple value operators, such as the IN operator which allows you to specify multiple values in a WHERE clause. (e.g., SELECT * FROM customers WHERE country IN ('Germany', 'France', 'UK'); A subquery can’t be nested in a SET command. The SET command is used with UPDATE to specify which columns (and values) are to be updated in a table (e.g., UPDATE Customers SET ContactName = 'Alfred Schmidt', City = Frankfurt' WHERE CustomerID = 1; To return more than one row, subqueries must contain multiple value operators such as the IN command.
154
Why are subqueries useful?
Subqueries help you identify and work with smaller sections of data from a large dataset.
155
BigQuery SQL environment, <> & !=
not equal to
156
BigQuery, Calculations
Put them in the SELECT statement Example 1: SELECT Date, Region, Total_Bags, Small_Bags, (Small_Bags / Total_Bags) * 100 AS Small_Bags_Percent FROM `vaulted-algebra-458421-a4.avocado_data.avocado_prices` WHERE Total_Bags != 0 Example 2: SELECT Date, Region, Small_Bags, Large_Bags, XLarge_Bags, Total_Bags, Small_Bags + Large_Bags + XLarge_Bags AS Total_Bags_Calc FROM vaulted-algebra-458421-a4.avocado_data.avocado_prices
157
BigQuery, Extract
The extract command lets us pull one part of a given date to use. Example: Get the year from all of the start_times in a column. (The data source contains information about bike routes & rides.) SELECT EXTRACT(YEAR FROM starttime)
158
When you do & don't need GROUP BY
When you don't need GROUP BY: -If your query is only aggregates (no plain columns): SELECT SUM(order_total) FROM orders; -- total across all rows -Or if it’s only plain columns (no aggregates): SELECT customer_id, order_total FROM orders; When you need GROUP BY: -When you have both aggregates and plain columns. Why? When you use something like SUM(order_total), you’re no longer talking about one row — you’re asking SQL to look across many rows at once and collapse them down. SELECT SUM(order_total) FROM orders; Totally fine — you asked for one number from the entire table. BUT imagine you ask SQL for both at the same time: SELECT customer_id, SUM(order_total) FROM orders; One part (customer_id) is row-based: “give me the ID from this row.” The other part (SUM(order_total)) is aggregate-based: “look at all the rows and collapse them.” SQL pauses and says: ❓ “Wait — which customer_id should I show alongside that SUM? The first one? The last one? Pick one at random? I need you to be specific.” How GROUP BY resolves the conflict GROUP BY tells SQL: 👉 “Don’t get confused — I want you to calculate the aggregate separately for each distinct value of this column (or set of columns).” So: SELECT customer_id, SUM(order_total) FROM orders GROUP BY customer_id; Mental model Think of SQL rows as post-it notes on a wall. -Row mode → SQL is reading each post-it note/line in the table as-is. -Aggregate mode → you sweep all the notes/lines together and add something up. -Mix them? You need GROUP BY to say “stack notes by customer first, then total each stack.”
159
Temporary Table/Temp Table
A database table that is created and exists temporarily on a database server. They're automatically deleted when you end your session. Useful when you only need a table for a short time to complete analysis tasks like calculations. Use cases: Temporary tables in SQL (TEMP, TEMPORARY, or #tables depending on dialect) are like scratchpads. They exist only for the duration of your session/query and don’t clutter your database afterward. They’re useful when you want to organize intermediate steps in analysis or speed things up without permanently saving data. Here are some concrete business use cases: 1. Complex analysis in stages Say you need to analyze customer churn, but the raw data has multiple joins (transactions, support tickets, logins). Use case: First, create a temp table with “active customers last quarter.” Then, create another temp table with “customers who haven’t logged in since.” Finally, join them to calculate churn rate. This avoids writing one gigantic unreadable query. 2. Performance optimization On large datasets, re-running the same subquery repeatedly is expensive. Example: You want to see daily revenue and daily refund totals separately, then combine them. Instead of calculating sales and refunds in two separate subqueries, you: Store daily transactions in a temp table. Re-use that temp table to calculate both revenue and refunds. 👉 Faster and less costly (especially in BigQuery, where query costs are tied to data scanned). 3. Data quality checks You might need to debug or QA transformations. Example: Before rolling up sales by region, you create a temp table of transactions missing region codes. This lets analysts fix data issues before they affect final reports. 4. Iterative what-if analysis Business analysts often test scenarios. Example: Marketing wants to know: What if we gave a 10% discount to customers who ordered 3+ times last month? Step 1: Create a temp table of “loyal customers.” Step 2: Apply the discount logic in a temp table. Step 3: Compare revenue before/after discount. 5. Breaking down complex logic Temp tables make it easier to communicate analysis to non-technical teammates. Example: Instead of handing off a giant query with 15 nested CTEs, you: Build stepwise temp tables: customers_with_orders, customers_with_refunds, net_revenue_by_customer. Then give a simple final query that selects from the last temp table. This helps when working with finance or marketing teams who need to audit logic. 6. One-off “sandboxing” Sometimes business wants an ad hoc analysis. Example: Sales VP asks: “How many orders came from customers created during last year’s Black Friday campaign?” You build a temp table of “Black Friday customers” once, then slice/dice orders against it in multiple ways without rewriting that filter each time.
160
BigQuery, WITH
A type of temp table that you can query from multiple times. WITH clause approximates a temp table (i.e., it creates something that does the same thing as a temp table). It doesn't add a temp table to a database for others to see, but you can see your results and anyone who needs to review your work can see the code that led to your results. Example: WITH longest_used_bike AS ( SELECT bike_id, SUM(duration_minutes) AS trip_duration FROM bigquery-public-data.austin_bikeshare.bikeshare_trips GROUP BY bike_id ORDER BY trip_duration DESC LIMIT 1 ) After writing the above, you can run as many queries as you want on that temp table. For instance: Find the station where the longest-used bike leaves most often. SELECT trips.start_station_id, COUNT(*) AS trip_ct FROM longest_used_bike AS longest INNER JOIN bigquery-public-data.austin_bikeshare.bikeshare_trips AS trips ON longest.bike_id = trips.bike_id GROUP BY trips.start_station_id ORDER BY trip_ct DESC LIMIT 1 *Note: You need to run both the temp table and the above query together and then you'll get the below output. Output: start_station_id: 3798 trip_ct: 177
161
Temp tables
Temporary tables, or temp tables, store subsets of data from standard data tables for a certain period of time. Temp tables allow you to run calculations in temporary data tables without needing to make modifications to the primary tables in your database. Because they are temporary, they are automatically deleted at the end of your SQL session.
162
SELECT INTO
Using SELECT INTO is a good practice when you want to keep the database uncluttered & other people don't need to use the table. NOTE: BigQuery doesn't recognize SELECT INTO; you have to use it in a different RDBMS (relational database management system). Example: SELECT * INTO AfricaSales FROM GlobalSales WHERE Region = "Africa"
163
CREATE TABLE
If lots of people are using a table, use the CREATE TABLE statement instead of the SELECT INTO statement. CREATE TABLE adds the table into the database. Example: CREATE TABLE AfricaSales AS ( SELECT * FROM GlobalSales WHERE Region = "Africa" )
164
SELECT INTO vs CREATE TABLE
If lots of people are using a table, use the CREATE TABLE statement instead of the SELECT INTO statement. Using SELECT INTO is a good practice when you want to keep the database uncluttered & other people don't need to use the table.
165
RDBMS
Relational Database Management System (RDBMS), is a software system that organizes data into tables with rows and columns to create, manage, and interact with relational databases. It uses the Structured Query Language (SQL) to manipulate data and maintain relationships between different data points through keys, ensuring data integrity, consistency, and security.
166
Pre-Processing of the Data
Temp Tables can be used as a holding area for storing values if you are making a series of calculations. This is sometimes referred to as pre-processing of the data.
167
Staging/Data Staging
Temp Tables can collect the results of multiple, separate queries. This is sometimes referred to as data staging. Staging is useful if you need to perform a query on the collected data or merge the collected data.
168
Benefits of temp tables
They are automatically deleted from the database when you end your SQL session. They can be used as a holding area for storing values if you are making a series of calculations. This is sometimes referred to as pre-processing of the data. They can collect the results of multiple, separate queries. This is sometimes referred to as data staging. Staging is useful if you need to perform a query on the collected data or merge the collected data. They can store a filtered subset of the database. You don’t need to select and filter the data each time you work with it. In addition, using fewer SQL commands helps to keep your data clean.
169
BigQuery, DELETE TABLE vs DROP TABLE
Deleting a temporary table removes the rows of the table but leaves the table definition and columns ready to be used again. Dropping a temporary table not only removes the information contained in the rows of the table, but removes the table variable definitions (columns) themselves.
170
Global Temp Tables
A temporary table is only accessible to the connection that created that temporary table. It is not accessible to other connections. However, we can create temporary tables that are accessible to all the open connections. Such temporary tables are called global temporary tables. The name of the global temporary table starts with a double hash symbol (##). Let’s create a global temporary table that contains records of all female students from the student table. USE schooldb; SELECT name, age, gender INTO ##FemaleStudents FROM student WHERE gender = 'Female'
171
Table Variables vs Temp Tables
You should consider table variables as a first choice when working with relatively small data sets. They are easier to work with and they trigger fewer recompiles in the routines in which they’re used, compared to using temporary tables. If you are doing more complex processing on temporary data, or need to use more than reasonably small amounts of data in them, then local temporary tables are likely to be a better choice. Note: there might be situations where you don’t need to create permanent tables. Remember: storing data in SQL costs the company money and resources. If you don’t need a permanent table, you can create temporary tables instead. Temporary tables only exist within your session (or up to 24 hours depending on your SQL platform) and aren’t shareable or viewable by others. Temporary tables exist only for you during your session. Think of temporary tables like a scratch pad where you can scribble out your calculations before putting down your final answer.
172
Methods to Create a Temp Table (regular and global temp tables)
1) SELECT INTO statement Example: SELECT select_list INTO temporary_table FROM table_name .... 2) CREATE TABLE statement Example: CREATE TABLE #haro_products ( product_name VARCHAR(MAX), list_price DEC(10,2) ); There are three key things to notice here: 1.If you are running two queries at the same time, which you have to do with temporary tables (only available during that session), then you need a semicolon separating each query. 2.The first query is where you created the temporary table. The second query references this temporary table and its fields. That is why you can access the sum of tickets as total_tickets in the second query. 3.When creating table names, don’t use spaces or the query will return an error. Best practice is to use camelcase capitalization when naming the table you are building. Camelcase capitalization means that you capitalize the start of each word without any spaces in between, just like a camel’s hump. For example, the table TicketsByOccasion uses camelcase capitalization.
173
WHERE vs HAVING clause
The WHERE clause is used to make filters on your table, like a filter for certain date ranges or specific countries. The HAVING clause is used to make filters on your aggregations and has to be paired with a GROUP BY clause.
174
**When to create a permanent table vs a temp table
Here are the three conditions for when you would want to create a permanent table. All three conditions should be met. 1. Complex query containing multiple JOINs 2. Result output is a table 3. You need to run the query frequently or on a regular basis Alternatively, Temporary tables are used to break down complex queries into smaller increments. These complex queries can contain multiple JOINs but don’t necessarily have to. You might want to use temporary tables if one or more of the following conditions apply: ●Slowly running query with multiple JOINs and WHERE statements ●Slowly running query containing GROUP BY and HAVING ●Nested queries (i.e. query within a query) ●If you need to do a calculation on top of a calculation (e.g. take sum per day then average across the day sums) If any of the above conditions are met, then using a temporary table may speed up your query, which will make it easier for you to write the query, as well as make it easier for you to troubleshoot your query if something goes wrong.
175