You are viewing a preview of this course. Sign in to start learning

Lesson 7: Advanced Queries — Window Functions

Master window functions for sophisticated data analysis: ROW_NUMBER, RANK, DENSE_RANK, running totals, moving averages, LAG and LEAD. Level up your SQL analytics.

Lesson 7: Advanced Queries — Window Functions 🪟

You've mastered JOINs, subqueries, and aggregations. Now it's time to unlock one of SQL's most powerful features: window functions. These let you perform calculations across sets of rows that are related to the current row—without collapsing them into a single result like GROUP BY does.

What Are Window Functions? 🤔

Imagine you're analyzing sales data and want to:

  • Rank products by revenue within each category
  • Calculate a running total of orders over time
  • Compare each day's sales to the previous day
  • Find the top 3 customers in each region

Regular aggregations with GROUP BY collapse your rows. Window functions let you keep all your rows while adding analytical calculations. Think of it as looking through a "window" at related rows while staying on your current row.

Regular Aggregation (GROUP BY):          Window Function:
+----------+-------+                      +----------+-------+------+
| category | total |                      | product  | price | rank |
+----------+-------+                      +----------+-------+------+
| Books    | 500   |  ← Lost details     | Book A   | 25    | 1    |
| Toys     | 300   |                      | Book B   | 20    | 2    |
+----------+-------+                      | Toy A    | 15    | 1    |
                                          | Toy B    | 10    | 2    |
                                          +----------+-------+------+
                                          ↑ Keeps all rows!

Core Syntax Structure 📋

Every window function follows this pattern:

function_name() OVER (
  [PARTITION BY column]
  [ORDER BY column]
  [ROWS/RANGE frame_specification]
)

Components:

  • Function: What calculation to perform (ROW_NUMBER, SUM, AVG, etc.)
  • OVER: Signals this is a window function
  • PARTITION BY: Divides data into groups (like GROUP BY, but doesn't collapse)
  • ORDER BY: Defines the order for calculations within each partition
  • Frame: Specifies which rows to include (more on this later)

💡 Tip: Think of PARTITION BY as creating "buckets" and ORDER BY as sorting within each bucket.

Ranking Functions 🏆

Three key functions assign ranks to rows:

1. ROW_NUMBER()

Assigns a unique sequential number to each row, even for ties.

SELECT 
  product_name,
  category,
  price,
  ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) as row_num
FROM products;

Result:

+-------------+----------+-------+---------+
| product_name| category | price | row_num |
+-------------+----------+-------+---------+
| Laptop      | Tech     | 1200  | 1       |
| Tablet      | Tech     | 800   | 2       |
| Mouse       | Tech     | 25    | 3       |
| Novel       | Books    | 30    | 1       |
| Textbook    | Books    | 30    | 2       | ← Same price, different number
| Magazine    | Books    | 10    | 3       |
+-------------+----------+-------+---------+

2. RANK()

Assigns the same rank to ties, but skips numbers after ties.

SELECT 
  student_name,
  score,
  RANK() OVER (ORDER BY score DESC) as rank
FROM exam_results;

Result:

+--------------+-------+------+
| student_name | score | rank |
+--------------+-------+------+
| Alice        | 95    | 1    |
| Bob          | 90    | 2    |
| Carol        | 90    | 2    | ← Tie at rank 2
| David        | 85    | 4    | ← Skips 3!
+--------------+-------+------+

3. DENSE_RANK()

Assigns the same rank to ties, but doesn't skip numbers.

SELECT 
  student_name,
  score,
  DENSE_RANK() OVER (ORDER BY score DESC) as dense_rank
FROM exam_results;

Result:

+--------------+-------+------------+
| student_name | score | dense_rank |
+--------------+-------+------------+
| Alice        | 95    | 1          |
| Bob          | 90    | 2          |
| Carol        | 90    | 2          | ← Tie at rank 2
| David        | 85    | 3          | ← No skip!
+--------------+-------+------------+

🧠 Mnemonic: RANK Really Skips, DENSE_RANK Doesn't Skip.

Analytical Functions: LAG and LEAD 📊

LAG() and LEAD() let you access data from previous or next rows without a self-join!

LAG() - Looking Backward ⏮️

SELECT 
  date,
  revenue,
  LAG(revenue, 1) OVER (ORDER BY date) as previous_day_revenue,
  revenue - LAG(revenue, 1) OVER (ORDER BY date) as change
FROM daily_sales;

Result:

+------------+---------+----------------------+--------+
| date       | revenue | previous_day_revenue | change |
+------------+---------+----------------------+--------+
| 2024-01-01 | 1000    | NULL                 | NULL   |
| 2024-01-02 | 1200    | 1000                 | 200    |
| 2024-01-03 | 950     | 1200                 | -250   |
+------------+---------+----------------------+--------+

Syntax: LAG(column, offset, default) where:

  • offset: How many rows back (default: 1)
  • default: Value if no previous row exists (default: NULL)

LEAD() - Looking Forward ⏭️

SELECT 
  employee_name,
  hire_date,
  LEAD(employee_name, 1) OVER (ORDER BY hire_date) as next_hire
FROM employees;

Works exactly like LAG but looks ahead instead of behind.

Running Totals and Aggregations 📈

You can use standard aggregation functions (SUM, AVG, COUNT) as window functions!

Running Total

SELECT 
  order_date,
  amount,
  SUM(amount) OVER (ORDER BY order_date) as running_total
FROM orders;

Result:

+------------+--------+---------------+
| order_date | amount | running_total |
+------------+--------+---------------+
| 2024-01-01 | 100    | 100           |
| 2024-01-02 | 150    | 250           | ← 100+150
| 2024-01-03 | 200    | 450           | ← 100+150+200
+------------+--------+---------------+

Moving Average 📊

Calculate averages over a sliding window of rows:

SELECT 
  date,
  temperature,
  AVG(temperature) OVER (
    ORDER BY date 
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) as three_day_avg
FROM weather;

This computes the average of the current row plus 2 preceding rows (3-day moving average).

Frame specification options:

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW       ← Last 3 rows
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW  ← All rows up to current
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING       ← Previous, current, next

Real-World Examples 🌍

Example 1: Top 3 Products Per Category 🏆

Scenario: Find the 3 most expensive products in each category.

WITH ranked_products AS (
  SELECT 
    product_name,
    category,
    price,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY price DESC) as rank
  FROM products
)
SELECT product_name, category, price, rank
FROM ranked_products
WHERE rank <= 3;

💡 Why DENSE_RANK? If two products tie for 2nd place, you still want to see the 3rd place product!

Example 2: Month-Over-Month Growth 📈

Scenario: Calculate percentage change in monthly revenue.

SELECT 
  month,
  revenue,
  LAG(revenue, 1) OVER (ORDER BY month) as previous_month,
  ROUND(
    (revenue - LAG(revenue, 1) OVER (ORDER BY month)) * 100.0 / 
    LAG(revenue, 1) OVER (ORDER BY month), 
    2
  ) as growth_percentage
FROM monthly_revenue;

Result:

+--------+---------+----------------+-------------------+
| month  | revenue | previous_month | growth_percentage |
+--------+---------+----------------+-------------------+
| Jan    | 10000   | NULL           | NULL              |
| Feb    | 12000   | 10000          | 20.00             |
| Mar    | 11500   | 12000          | -4.17             |
+--------+---------+----------------+-------------------+

Example 3: Customer Purchase Patterns 🛒

Scenario: For each customer, show their purchases with running total and average order value.

SELECT 
  customer_id,
  order_date,
  amount,
  SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as total_spent,
  AVG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as avg_order_value,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_number
FROM orders
ORDER BY customer_id, order_date;

Result:

+-------------+------------+--------+-------------+-----------------+--------------+
| customer_id | order_date | amount | total_spent | avg_order_value | order_number |
+-------------+------------+--------+-------------+-----------------+--------------+
| 101         | 2024-01-01 | 50     | 50          | 50.00           | 1            |
| 101         | 2024-01-05 | 75     | 125         | 62.50           | 2            |
| 101         | 2024-01-10 | 100    | 225         | 75.00           | 3            |
| 102         | 2024-01-02 | 200    | 200         | 200.00          | 1            |
+-------------+------------+--------+-------------+-----------------+--------------+

Notice how PARTITION BY customer_id creates separate "windows" for each customer!

Example 4: Gap and Island Problem 🏝️

Scenario: Find consecutive days of activity.

WITH numbered_days AS (
  SELECT 
    activity_date,
    ROW_NUMBER() OVER (ORDER BY activity_date) as rn,
    DATE_SUB(activity_date, INTERVAL ROW_NUMBER() OVER (ORDER BY activity_date) DAY) as island_id
  FROM user_activity
)
SELECT 
  MIN(activity_date) as streak_start,
  MAX(activity_date) as streak_end,
  COUNT(*) as days_in_streak
FROM numbered_days
GROUP BY island_id
ORDER BY streak_start;

This clever technique identifies "islands" of consecutive dates!

Common Mistakes ⚠️

1. Forgetting OVER Clause

Wrong:

SELECT product_name, RANK() FROM products;  -- Error!

Correct:

SELECT product_name, RANK() OVER (ORDER BY price DESC) FROM products;

2. Confusing PARTITION BY with GROUP BY

Wrong thinking: "PARTITION BY collapses rows like GROUP BY"

Truth: PARTITION BY creates separate windows but keeps all rows. GROUP BY reduces rows.

3. Mixing Window Functions with Regular Aggregates

Wrong:

SELECT 
  category,
  COUNT(*),  -- Regular aggregate
  ROW_NUMBER() OVER (ORDER BY price)  -- Window function
FROM products
GROUP BY category;  -- Conflict!

Correct: Use subquery or CTE:

WITH numbered AS (
  SELECT *, ROW_NUMBER() OVER (ORDER BY price) as rn
  FROM products
)
SELECT category, COUNT(*)
FROM numbered
GROUP BY category;

4. Wrong Frame Specification

Wrong:

SUM(amount) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)

This gives you 4 rows (3 preceding + current), not 3!

Correct for 3-row window:

SUM(amount) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

5. Not Handling NULLs with LAG/LEAD

Risky:

revenue - LAG(revenue) OVER (ORDER BY date)  -- NULL on first row!

Safe:

revenue - COALESCE(LAG(revenue) OVER (ORDER BY date), 0)

Performance Considerations 🚀

  1. Window functions can be expensive - they need to sort data
  2. Use indexes on columns in PARTITION BY and ORDER BY
  3. Limit partitions - fewer partitions = better performance
  4. Consider materialized views for frequently-used window calculations

🤔 Did you know? Window functions were added to the SQL standard in 2003, but many databases didn't support them until the 2010s. Now they're essential for modern data analysis!

Key Takeaways 🎯

Window functions perform calculations across related rows without collapsing them

PARTITION BY divides data into groups; ORDER BY sorts within groups

Ranking functions: ROW_NUMBER (unique), RANK (skips), DENSE_RANK (no skip)

LAG/LEAD access previous/next rows without self-joins

Running totals: Use SUM() OVER (ORDER BY ...)

Moving averages: Use frame specifications like ROWS BETWEEN

Always include OVER clause - that's what makes it a window function!

Common pattern: Use CTE with window function, then filter results

📚 Further Study

📋 Quick Reference Card

╔════════════════════════════════════════════════════════════╗
║              WINDOW FUNCTIONS CHEAT SHEET                  ║
╠════════════════════════════════════════════════════════════╣
║ BASIC SYNTAX:                                              ║
║   function() OVER ([PARTITION BY x] [ORDER BY y])         ║
║                                                            ║
║ RANKING FUNCTIONS:                                         ║
║   ROW_NUMBER()  → 1,2,3,4,5     (always unique)           ║
║   RANK()        → 1,2,2,4,5     (skips after ties)        ║
║   DENSE_RANK()  → 1,2,2,3,4     (no skips)                ║
║                                                            ║
║ ACCESS OTHER ROWS:                                         ║
║   LAG(col, n)   → value from n rows before                ║
║   LEAD(col, n)  → value from n rows after                 ║
║                                                            ║
║ AGGREGATIONS:                                              ║
║   SUM/AVG/COUNT/MIN/MAX() OVER (...)                      ║
║                                                            ║
║ FRAME SPECIFICATIONS:                                      ║
║   ROWS BETWEEN n PRECEDING AND CURRENT ROW                ║
║   ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW        ║
║   ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING                ║
║                                                            ║
║ COMMON PATTERNS:                                           ║
║   Top N per group:                                         ║
║     WHERE RANK() OVER (...) <= N                          ║
║   Running total:                                           ║
║     SUM(x) OVER (ORDER BY date)                           ║
║   % change:                                                ║
║     (x - LAG(x)) / LAG(x) * 100                           ║
╚════════════════════════════════════════════════════════════╝