You are viewing a preview of this course. Sign in to start learning

Lesson 8: Performance & Best Practices

Master SQL optimization: reading EXPLAIN plans, index strategies, avoiding N+1 queries, and writing maintainable code.

Lesson 8: Performance & Best Practices ⚑

Introduction πŸš€

You've learned how to write SQL queries that work, but do they work efficiently? A query that returns correct results in 5 seconds might return the same data in 50 milliseconds with proper optimization. In production systems handling thousands of requests per second, this difference isn't just noticeableβ€”it's the difference between a responsive application and one that crashes under load.

This lesson teaches you how to think like a database engine, identify performance bottlenecks, and write SQL that scales. You'll learn to read EXPLAIN plans (the database's execution blueprint), understand indexes (the secret to fast lookups), avoid the dreaded N+1 query problem, and recognize common anti-patterns that plague real-world applications.

πŸ’‘ Remember: Premature optimization is the root of all evil, but knowing how to optimize when needed is the mark of a professional developer.


Core Concepts 🎯

1. Understanding EXPLAIN Plans πŸ“Š

Every SQL query goes through a query planner that decides how to execute it. The EXPLAIN command shows you this execution planβ€”the roadmap the database follows to retrieve your data.

EXPLAIN SELECT * FROM users WHERE email = 'john@example.com';

A typical EXPLAIN output looks like:

Seq Scan on users  (cost=0.00..35.50 rows=1 width=100)
  Filter: (email = 'john@example.com')

Key metrics to understand:

  • Seq Scan (Sequential Scan): The database reads every row in the table. This is slow for large tables.
  • Index Scan: The database uses an index to jump directly to relevant rows. Much faster!
  • cost: Estimated computational expense (startup cost..total cost)
  • rows: Estimated number of rows returned
  • width: Average row size in bytes

🧠 Mnemonic: "Seq Scan = Slow Search" (checks every row)

Different scan types:

+------------------+---------------------------+
| Scan Type        | When It Happens           |
+------------------+---------------------------+
| Seq Scan         | No index, small table     |
| Index Scan       | Using index, few rows     |
| Index Only Scan  | All data in index itself  |
| Bitmap Scan      | Multiple index conditions |
| Nested Loop      | JOIN with small dataset   |
| Hash Join        | JOIN with larger dataset  |
| Merge Join       | JOIN on sorted data       |
+------------------+---------------------------+

⚠️ EXPLAIN vs EXPLAIN ANALYZE: EXPLAIN shows the plan, EXPLAIN ANALYZE actually runs the query and shows real timing. Use ANALYZE on development/staging onlyβ€”it executes the query!

2. Index Strategies πŸ”

An index is like a book's table of contentsβ€”it lets you jump to the right page instead of reading every page sequentially.

How indexes work conceptually:

Without Index:               With Index on email:
Table scan every row         B-tree lookup
  ↓                              ↓
Row 1: alice@...           'john@...' β†’ Row 4523
Row 2: bob@...
Row 3: charlie@...
...
Row 4523: john@... βœ“
...
Row 10000: zoe@...

Time: O(n)                 Time: O(log n)

Creating indexes:

CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);
CREATE UNIQUE INDEX idx_users_username ON users(username);

When to index:

βœ… DO index:

  • Columns in WHERE clauses
  • Foreign keys used in JOINs
  • Columns in ORDER BY
  • Columns in GROUP BY
  • Frequently searched fields

❌ DON'T index:

  • Small tables (< 1000 rows)
  • Columns with low cardinality (few unique values like boolean)
  • Columns that change frequently
  • Tables with heavy INSERT/UPDATE load

πŸ€” Did you know? Indexes speed up reads but slow down writes! Each INSERT/UPDATE/DELETE must update all relevant indexes. This is the classic read-vs-write tradeoff.

Composite indexes (multi-column) follow the leftmost prefix rule:

INDEX on (user_id, created_at, status)

βœ… Can use for:
   WHERE user_id = 5
   WHERE user_id = 5 AND created_at > '2024-01-01'
   WHERE user_id = 5 AND created_at > '2024-01-01' AND status = 'active'

❌ Cannot use for:
   WHERE created_at > '2024-01-01'  (doesn't start with user_id)
   WHERE status = 'active'          (skips first columns)

3. The N+1 Query Problem 🐌

This is the most common performance killer in applications. Here's the scenario:

-- Query 1: Get all users
SELECT * FROM users LIMIT 10;

-- Then in your application loop:
for each user:
    -- Query 2, 3, 4... 11: Get each user's orders
    SELECT * FROM orders WHERE user_id = ?

The problem: You execute 1 query to get users, then N additional queries (one per user). For 10 users, that's 11 queries. For 1000 users, that's 1001 queries! 😱

The solution: Use a JOIN or IN clause to fetch everything in one query:

-- Solution 1: JOIN
SELECT users.*, orders.* 
FROM users 
LEFT JOIN orders ON users.id = orders.user_id
LIMIT 10;

-- Solution 2: IN clause with subquery
SELECT * FROM users WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
SELECT * FROM orders WHERE user_id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);

Visual comparison:

N+1 Problem:                Optimized:
  [App]                       [App]
    ↓                           ↓
  Query 1 β†’ [DB]           Query 1 (JOIN) β†’ [DB]
    ↓                           ↓
  Query 2 β†’ [DB]           [All data returned]
    ↓
  Query 3 β†’ [DB]
    ↓
   ...
    ↓
  Query N β†’ [DB]

Time: N * latency          Time: 1 * latency

🧠 Mnemonic: "N+1 = Never do 1 query per row!"

4. Common Anti-Patterns ⚠️

**Anti-Pattern #1: SELECT ***

-- BAD: Fetches all columns
SELECT * FROM users WHERE id = 5;

-- GOOD: Only fetch what you need
SELECT id, name, email FROM users WHERE id = 5;

Why it matters: SELECT * transfers unnecessary data over the network, prevents covering indexes (indexes that contain all needed columns), and breaks when schema changes.

Anti-Pattern #2: Functions on Indexed Columns

-- BAD: Function prevents index usage
SELECT * FROM users WHERE UPPER(email) = 'JOHN@EXAMPLE.COM';

-- GOOD: Store normalized data or use functional index
SELECT * FROM users WHERE email = 'john@example.com';

When you apply a function to an indexed column, the database can't use the index!

Anti-Pattern #3: OR in WHERE Clauses

-- BAD: OR often prevents index usage
SELECT * FROM products WHERE category = 'electronics' OR price < 100;

-- GOOD: Use UNION when possible
SELECT * FROM products WHERE category = 'electronics'
UNION
SELECT * FROM products WHERE price < 100;

Anti-Pattern #4: LIKE with Leading Wildcard

-- BAD: Cannot use index
SELECT * FROM users WHERE email LIKE '%@example.com';

-- GOOD: Trailing wildcard CAN use index
SELECT * FROM users WHERE email LIKE 'john%';

Anti-Pattern #5: NOT IN with NULLs

-- DANGEROUS: NOT IN returns no rows if subquery contains NULL
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned);
-- If banned.user_id has any NULL, this returns ZERO rows!

-- SAFE: Use NOT EXISTS
SELECT * FROM users u 
WHERE NOT EXISTS (SELECT 1 FROM banned b WHERE b.user_id = u.id);

5. Writing Maintainable SQL πŸ“

Use meaningful aliases:

-- BAD
SELECT u.*, o.* FROM users u JOIN orders o ON u.id = o.user_id;

-- GOOD
SELECT 
    users.id,
    users.name,
    orders.total,
    orders.created_at
FROM users
JOIN orders ON users.id = orders.user_id;

Format for readability:

SELECT 
    u.id,
    u.name,
    COUNT(o.id) AS order_count,
    SUM(o.total) AS total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at >= '2024-01-01'
GROUP BY u.id, u.name
HAVING COUNT(o.id) > 5
ORDER BY total_spent DESC
LIMIT 10;

Use comments for complex logic:

-- Find users who placed orders in the last 30 days
-- but haven't placed one in the last 7 days (re-engagement target)
SELECT DISTINCT user_id
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
  AND user_id NOT IN (
      SELECT user_id FROM orders 
      WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
  );

Detailed Examples πŸ’»

Example 1: Optimizing a Slow Query

Initial query (slow on large table):

SELECT * FROM orders 
WHERE YEAR(created_at) = 2024 AND status = 'completed';

Step 1: Check EXPLAIN plan

EXPLAIN SELECT * FROM orders 
WHERE YEAR(created_at) = 2024 AND status = 'completed';

-- Result: Seq Scan (bad!)
-- Problem: YEAR() function prevents index usage

Step 2: Rewrite without function

SELECT id, user_id, total, created_at 
FROM orders 
WHERE created_at >= '2024-01-01' 
  AND created_at < '2025-01-01'
  AND status = 'completed';

Step 3: Add composite index

CREATE INDEX idx_orders_created_status ON orders(created_at, status);

Step 4: Verify improvement

EXPLAIN ANALYZE SELECT id, user_id, total, created_at 
FROM orders 
WHERE created_at >= '2024-01-01' 
  AND created_at < '2025-01-01'
  AND status = 'completed';

-- Result: Index Scan using idx_orders_created_status
-- Execution time: 2.3ms (was 450ms)

Optimization summary:

+-------------------------+------------------+
| Change                  | Impact           |
+-------------------------+------------------+
| Removed YEAR()          | Index usable     |
| Removed SELECT *        | Less data xfer   |
| Added composite index   | Fast lookup      |
+-------------------------+------------------+
Result: 195x faster! ⚑

Example 2: Solving N+1 in an E-Commerce App

Scenario: Display 50 products with their category names.

❌ N+1 Problem Code:

-- Query 1: Get products
SELECT id, name, category_id, price FROM products LIMIT 50;

-- Application loop (executes 50 times!):
for each product:
    SELECT name FROM categories WHERE id = product.category_id;

Total: 51 queries (1 + 50)

βœ… Optimized with JOIN:

SELECT 
    p.id,
    p.name AS product_name,
    p.price,
    c.name AS category_name
FROM products p
JOIN categories c ON p.category_id = c.id
LIMIT 50;

Total: 1 query!

Performance comparison:

N+1 Approach:
  51 queries Γ— 5ms latency = 255ms
  
JOIN Approach:
  1 query Γ— 8ms = 8ms
  
Speedup: 31x faster! πŸš€

Example 3: Index Strategy for Search Feature

Requirement: Users search products by name and filter by category and price range.

Typical queries:

-- Search by name
SELECT * FROM products WHERE name LIKE 'laptop%';

-- Filter by category
SELECT * FROM products WHERE category_id = 5;

-- Filter by price range
SELECT * FROM products WHERE price BETWEEN 100 AND 500;

-- Combined search
SELECT * FROM products 
WHERE name LIKE 'laptop%' 
  AND category_id = 5 
  AND price BETWEEN 100 AND 500;

Index strategy:

-- Index 1: For name searches
CREATE INDEX idx_products_name ON products(name);
-- Supports: WHERE name LIKE 'laptop%' (but NOT '%laptop%')

-- Index 2: For category filtering
CREATE INDEX idx_products_category ON products(category_id);

-- Index 3: Composite for common combined search
CREATE INDEX idx_products_cat_price ON products(category_id, price);
-- Supports: WHERE category_id = ? AND price BETWEEN ? AND ?

Why not one huge composite index?

CREATE INDEX idx_products_all ON products(name, category_id, price);

This would only help queries that filter by name first (leftmost prefix rule). Our separate indexes give flexibility for different query patterns.

Example 4: Avoiding Anti-Patterns in Reports

❌ BAD: Multiple scans and functions

SELECT 
    (SELECT COUNT(*) FROM orders WHERE MONTH(created_at) = 1) AS jan_orders,
    (SELECT COUNT(*) FROM orders WHERE MONTH(created_at) = 2) AS feb_orders,
    (SELECT COUNT(*) FROM orders WHERE MONTH(created_at) = 3) AS mar_orders,
    (SELECT COUNT(*) FROM orders WHERE MONTH(created_at) = 4) AS apr_orders
FROM orders
LIMIT 1;

Problems:

  • 4 separate table scans
  • MONTH() prevents index usage
  • Scans entire table for each subquery

βœ… GOOD: Single scan with CASE

SELECT 
    COUNT(CASE WHEN created_at >= '2024-01-01' AND created_at < '2024-02-01' THEN 1 END) AS jan_orders,
    COUNT(CASE WHEN created_at >= '2024-02-01' AND created_at < '2024-03-01' THEN 1 END) AS feb_orders,
    COUNT(CASE WHEN created_at >= '2024-03-01' AND created_at < '2024-04-01' THEN 1 END) AS mar_orders,
    COUNT(CASE WHEN created_at >= '2024-04-01' AND created_at < '2024-05-01' THEN 1 END) AS apr_orders
FROM orders
WHERE created_at >= '2024-01-01' AND created_at < '2024-05-01';

Benefits:

  • Single table scan
  • Index on created_at usable
  • Filters only relevant date range

Common Mistakes ⚠️

❌ Mistake 1: Over-Indexing

-- Creating indexes on everything
CREATE INDEX idx1 ON products(name);
CREATE INDEX idx2 ON products(price);
CREATE INDEX idx3 ON products(category_id);
CREATE INDEX idx4 ON products(stock);
CREATE INDEX idx5 ON products(created_at);
-- ... and 10 more!

Why it's bad: Each index slows down INSERT/UPDATE/DELETE operations. Only index what you actually query frequently.

βœ… Better approach: Start with no indexes, add them based on actual slow queries found in production logs or EXPLAIN analysis.

❌ Mistake 2: Using EXPLAIN Without ANALYZE

EXPLAIN SELECT * FROM huge_table WHERE complex_condition = 'value';

Why it's bad: EXPLAIN shows estimates, but doesn't show actual execution time or if statistics are outdated.

βœ… Better approach: Use EXPLAIN ANALYZE on non-production or with LIMIT:

EXPLAIN ANALYZE SELECT * FROM huge_table WHERE complex_condition = 'value' LIMIT 100;

❌ Mistake 3: Ignoring Query Context

-- Optimizing a query that runs once per day
SPEND 3 HOURS β†’ Save 200ms

-- Ignoring a query that runs 10,000 times per minute
IGNORE β†’ Losing 500ms Γ— 10,000 Γ— 60 = 5,000,000ms = 83 minutes wasted/hour!

βœ… Better approach: Profile your application! Focus on:

  1. Queries executed most frequently
  2. Queries taking longest individually
  3. Queries on tables growing rapidly

❌ Mistake 4: Premature Optimization

-- Adding indexes before having data
CREATE TABLE users (id INT PRIMARY KEY, name VARCHAR(100));
CREATE INDEX idx_users_name ON users(name);
-- Table has 10 rows total...

Why it's bad: Small tables don't benefit from indexes. The overhead hurts more than it helps.

βœ… Better approach:

  1. Write correct SQL first
  2. Test with realistic data volumes
  3. Measure actual performance
  4. Optimize only proven bottlenecks

❌ Mistake 5: Forgetting About NULL

-- Assuming this returns all non-admin users
SELECT * FROM users WHERE role != 'admin';
-- WRONG! Excludes users where role IS NULL

βœ… Better approach:

SELECT * FROM users WHERE role != 'admin' OR role IS NULL;
-- Or use COALESCE:
SELECT * FROM users WHERE COALESCE(role, 'user') != 'admin';

Key Takeaways 🎯

  1. EXPLAIN is your friend: Always check execution plans for slow queries. Look for Seq Scans on large tables.

  2. Indexes are not free: They speed up reads but slow down writes. Index strategically based on query patterns.

  3. N+1 kills performance: Always fetch related data in one query using JOINs or IN clauses, not in application loops.

  4. SELECT only what you need: Avoid SELECT *. Be explicit about columns.

  5. Functions break indexes: WHERE YEAR(date) = 2024 can't use an index. Use WHERE date >= '2024-01-01' AND date < '2025-01-01'.

  6. Leftmost prefix rule: Composite index on (a, b, c) works for queries on (a), (a,b), (a,b,c) but NOT (b), (c), or (b,c).

  7. LIKE patterns: name LIKE 'john%' can use index, name LIKE '%john%' cannot.

  8. NULL behavior: Remember that WHERE column != 'value' excludes NULL rows. Test with NULL data!

  9. Measure, don't guess: Use EXPLAIN ANALYZE and production metrics to find real bottlenecks.

  10. Maintainable > Clever: Write clear, readable SQL. Future you (and your teammates) will thank you.


πŸ”§ Try This: Optimization Exercise

Given this slow query:

SELECT * FROM orders 
WHERE DATE(created_at) = CURRENT_DATE 
  AND (status = 'pending' OR status = 'processing');

Identify three problems and how you'd fix them:

πŸ’‘ Solution (click to reveal)
  1. **SELECT ***: Only select needed columns
  2. DATE() function: Prevents index usage, use range instead
  3. OR clause: Consider UNION or IN clause

Optimized:

SELECT id, user_id, total, status
FROM orders 
WHERE created_at >= CURRENT_DATE 
  AND created_at < CURRENT_DATE + INTERVAL '1 day'
  AND status IN ('pending', 'processing');

Add index: CREATE INDEX idx_orders_date_status ON orders(created_at, status);


πŸ“š Further Study

  1. Use The Index, Luke - https://use-the-index-luke.com/ - Excellent visual guide to database indexing
  2. PostgreSQL EXPLAIN Documentation - https://www.postgresql.org/docs/current/using-explain.html - Official guide to reading query plans
  3. SQL Performance Explained (Book) - https://sql-performance-explained.com/ - Deep dive into optimization across different databases

πŸ“‹ Quick Reference Card

+----------------------------------+---------------------------------------+
| CONCEPT                          | KEY POINTS                            |
+----------------------------------+---------------------------------------+
| EXPLAIN                          | Shows execution plan                  |
|                                  | Look for: Seq Scan (bad on big table)|
|                                  | Use EXPLAIN ANALYZE for actual times  |
+----------------------------------+---------------------------------------+
| Indexes                          | B-tree structure for fast lookups     |
|                                  | Speed reads, slow writes              |
|                                  | Index WHERE/JOIN/ORDER BY columns     |
|                                  | Don't over-index!                     |
+----------------------------------+---------------------------------------+
| Composite Index (a,b,c)          | Works for: (a), (a,b), (a,b,c)       |
|                                  | Doesn't work for: (b), (c), (b,c)    |
|                                  | Leftmost prefix rule!                 |
+----------------------------------+---------------------------------------+
| N+1 Problem                      | 1 query + N queries in loop          |
|                                  | Solution: Use JOIN or IN clause       |
|                                  | Fetch related data in one query       |
+----------------------------------+---------------------------------------+
| Anti-Patterns to Avoid           | SELECT *                              |
|                                  | Functions on indexed columns          |
|                                  | OR in WHERE (often)                   |
|                                  | LIKE '%prefix' (leading wildcard)     |
|                                  | NOT IN with possible NULLs            |
+----------------------------------+---------------------------------------+
| Optimization Workflow            | 1. Measure (find slow queries)        |
|                                  | 2. EXPLAIN (understand plan)          |
|                                  | 3. Fix (rewrite or add index)         |
|                                  | 4. Verify (EXPLAIN ANALYZE)           |
+----------------------------------+---------------------------------------+
| SQL Formatting                   | Meaningful aliases                    |
|                                  | One clause per line                   |
|                                  | Comments for complex logic            |
|                                  | Explicit column names                 |
+----------------------------------+---------------------------------------+

Remember: Correct first, fast second. A slow query that returns correct data is better than a fast query that returns wrong data! ⚑