Introduction
As data engineers and analysts, we rely heavily on SQL databases to store and query our data efficiently. To speed up our queries, we often create indexes on frequently filtered columns. However, there’s a common gotcha that can cause our queries to run slower than expected, even with appropriate indexes in place.
In this post, we’ll explore how applying functions to indexed columns in the WHERE clause can prevent SQL optimizers from utilizing those indexes effectively.
πͺ The Power of Indexes
Let’s start with a quick refresher on indexes. An index is a data structure that allows the database to quickly locate and retrieve specific rows based on the indexed column(s). When you create an index on a column, the database builds a separate, sorted data structure that maps the column values to their corresponding row locations. This enables the database to find matching rows much faster than scanning the entire table.
For example, consider a user table with an index on the email column:
|
|
Now, when you query the table with a filter on the email column, the optimizer can use the index to quickly find the matching rows:
|
|
π΅οΈββοΈ The Hidden Pitfall: Functions on Indexed Columns
However, things can go awry when you apply functions to indexed columns in the WHERE clause. Let’s say you want to find all users whose email address is in a specific domain:
|
|
In this case, the SQL optimizer may not be able to use the index on the email column effectively. Why? Because the index is built on the raw email values, not the result of applying the SUBSTRING function to those values.
When you apply a function to an indexed column, the optimizer often has to scan the entire table or index to evaluate the function for each row. This can be much slower than directly looking up the raw column values in the index.
π Real-World Examples
Let’s look at a couple more examples where applying functions to indexed columns can ditch performance:
- π Case-Insensitive Search
Suppose you want to find all users with a specific first name, regardless of case:
|
|
Even if you have an index on the first_name column, the optimizer may not be able to use it efficiently because the index is built on the original case-sensitive values, not the lowercase versions.
- π Date Manipulation
Consider a query that finds all orders placed on a specific day:
|
|
If you have an index on the order_date column, the optimizer may struggle to use it because the index is built on the raw timestamp values, not the truncated date values.
Workarounds and Solutions
So, what can you do when you need to apply functions to indexed columns in your queries? Here are a few strategies:
1. Create Function-Based Indexes
Some databases, like Oracle and PostgreSQL, support function-based indexes. These indexes are built on the result of applying a function to a column, rather than the raw column values. For example:
|
|
With this index, the optimizer can efficiently handle queries that use LOWER(first_name) in the WHERE clause.
However, function-based indexes have limitations. They are not supported by all databases, and they can be less flexible than regular indexes because they are tailored to specific function use cases.
- Rewrite Queries
In some cases, you can rewrite your queries to avoid applying functions to indexed columns. For example, instead of using LOWER(first_name), you could store a separate lowercase version of the first_name column and create a regular index on it:
Then, you can rewrite the query to use the first_name_lower column directly:
|
|
This approach requires additional storage space and maintenance, but it can enable the optimizer to use the index efficiently.
- Use Optimizer Hints
Some databases provide optimizer hints that allow you to force the use of a specific index or join order. While hints should be used sparingly and with caution, they can be helpful in cases where the optimizer struggles to choose the best execution plan.
In PostgreSQL, you can use the pg_hint_plan
extension to provide optimizer hints. First, you need to install the extension:
|
|
Then, you can use hints like SeqScan
, IndexScan
, or IndexOnlyScan
to suggest the optimizer to use a specific scan method. For example:
However, optimizer hints are not a silver bullet. They can make your queries less portable and more brittle to schema changes. It’s generally better to let the optimizer make its own decisions based on statistics and cost estimates.
π Demonstrating the Performance Impact with PostgreSQL and Docker
Let’s see the performance impact in action using a PostgreSQL Docker container. Follow these steps:
- π³ Pull the PostgreSQL Docker image and start a container:
- π Connect to the PostgreSQL container:
|
|
- π§ Execute the SQL statements to create a table, insert sample data, and create an index:
- π Query using the indexed column directly:
We get this:
Let’s conclude that the optimizer efficiently uses the index to find the matching row.
- π Query using a function on the indexed column:
|
|
From the query plan, we can all see that the last query is ~290 times slower. βββ
Notice that applying the SUBSTRING function prevents the optimizer from using the index effectively, resulting in a sequential scan of the entire table.
The performance difference becomes significant as the table size grows. Applying functions to indexed columns can lead to much slower query execution compared to leveraging indexes directly. β°
The query that uses hints will perform the same as the query from the fourth step.
Conclusion
As data engineers and analysts, it’s crucial to understand how SQL optimizers work and to be aware of potential performance gotchas.
By keeping an eye out for function application on indexed columns and choosing appropriate strategies to address them, we can ensure that our queries run efficiently and our databases perform at their best. πͺπ