In this post, we’ll look at how to improve queries using generated columns and ProxySQL instead of implementing a referenced table.
Developers and architects don’t always have the time or complete information to properly analyze and design a database. That is why we see tables with more fields than needed, or with incorrect types. The best solution is implementing a change in the database schema and/or application level. In this post, we’ll look at an example of generated columns (using a char field) instead of creating a referenced table, and how using generated columns and ProxySQL avoids changes at the application level.
For this example, I will be using the film table of the Sakila database (with some changes). The original film table had a language_id as tinyint, which refers to the language table:
1 2 3 4 5 6 7 8 9 | CREATE TABLE film ( film_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, title VARCHAR(255) NOT NULL, description TEXT DEFAULT NULL, release_year YEAR DEFAULT NULL, language_id TINYINT UNSIGNED NOT NULL, original_language_id TINYINT UNSIGNED DEFAULT NULL, ..... )ENGINE=InnoDB DEFAULT CHARSET=utf8; |
I simplified the design of the table and added the language field as a char(20), as it is in table language:
1 2 3 4 5 6 7 8 9 10 | CREATE TABLE `film` ( `film_id` int(10) unsigned NOT NULL AUTO_INCREMENT, `title` varchar(255) NOT NULL, `description` text, `release_year` year(4) DEFAULT NULL, `language` char(20) NOT NULL, `rating` enum('G','PG','PG-13','R','NC-17') DEFAULT 'G', PRIMARY KEY (`film_id`), KEY `idx_title` (`title`) ) ENGINE=InnoDB AUTO_INCREMENT=1001 DEFAULT CHARSET=utf8 |
We inserted 1M records and executed a query that filters by language:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | mysql> select count(*) from film where language ='Mandarin'; +----------+ | count(*) | +----------+ | 256000 | +----------+ 1 row in set (0.92 sec) mysql> explain select count(*) from film where language ='Mandarin'; +----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | film | NULL | ALL | NULL | NULL | NULL | NULL | 1014813 | 10.00 | Using where | +----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+ 1 row in set, 1 warning (0.02 sec) |
The explain plan shows that it is performing a full table scan. Adding an index over language speeds up the query from 0.92 sec to 0.09 sec:
1 2 3 4 5 6 7 8 9 10 11 12 | mysql> ALTER TABLE film ADD INDEX (language); Query OK, 0 rows affected (4.87 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> select count(*) from film where language ='Mandarin'; +----------+ | count(*) | +----------+ | 256000 | +----------+ 1 row in set (0.09 sec) |
And the explain plan shows that it is using the language index:
1 2 3 4 5 6 7 8 9 10 11 12 13 | *************************** 1. row *************************** id: 1 select_type: SIMPLE table: new_film partitions: NULL type: ref possible_keys: language key: language key_len: 60 ref: const rows: 245136 filtered: 100.00 Extra: Using index |
Now, we are going to add a generated column and an index over this new field:
1 2 3 4 5 6 7 8 9 10 11 | ALTER TABLE film ADD COLUMN language_id tinyint GENERATED ALWAYS AS ( CASE language WHEN 'English' THEN 1 WHEN 'Italian' THEN 2 WHEN 'Japanese' THEN 3 WHEN 'Mandarin' THEN 4 ELSE 0 END ) VIRTUAL ; ALTER TABLE film ADD INDEX (language_id); |
Finally, we are going to add a rule in ProxySQL with the replace pattern:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | replace into mysql_query_rules (rule_id,match_pattern,replace_pattern,apply,active) values (3, "(.*)FROM(.*)film(.*)WHERE(.*)language ='(.*)'(.*)", "1 FROM 2 film 3 WHERE 4 language_id = ( CASE '5' WHEN 'English' THEN 1 WHEN 'Italian' THEN 2 WHEN 'Japanese' THEN 3 WHEN 'Mandarin' THEN 4 ELSE 0 END ) 6",1,1); LOAD MYSQL QUERY RULES TO RUNTIME; |
Now, when we execute the query through ProxySQL, it will replace:
1 | SELECT count(*) FROM film WHERE language ='Mandarin'; |
With:
1 2 3 4 5 6 7 8 9 | SELECT count(*) FROM film WHERE language_id = ( CASE 'Mandarin' WHEN 'English' THEN 1 WHEN 'Italian' THEN 2 WHEN 'Japanese' THEN 3 WHEN 'Mandarin' THEN 4 ELSE 0 END ) |
The explain plan shows it is using the language_id index:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | mysql> explain SELECT count(*) FROM film WHERE language ='Mandarin' G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film partitions: NULL type: ref possible_keys: language_id key: language_id key_len: 2 ref: const rows: 507142 filtered: 100.00 Extra: Using index |
And that the query time decreases to 0.06 sec:
1 2 3 4 5 6 7 | mysql> SELECT count(*) FROM film WHERE language ='Mandarin'; +----------+ | count(*) | +----------+ | 256000 | +----------+ 1 row in set (0.06 sec) |
This speed up (from 0.09 to 0.06) on the select is because it uses a smaller index, as you can see in the key_len of the explain plan. Performing a range scan over entries of 64 bytes is slower than performing a range scan over entries of 8 bytes (the amount of data to read is less). For instance, the index_length with just the language index is 36274176 (34.6MB), while over the language_id is 14172160 (13.5MB).
Conclusion
Generated columns are not a long-term solution. The long-term solution is to add a table, or have the application internally use a referenced table. However, I found it particularly useful or practical in two scenarios:
- When database performance is dropping and change at the application level is not a priority for the business. This effort can be implemented by the DBA team without any other intervention.
- When you must show the performance improvement that the change will make before modifying anything at the application level.
Cons, you need to install ProxySQL (if you are testing, you can install it just on the test server), which adds a new layer (more complex) to your infrastructure (you need to use MySQL 5.7 at least, as well). Another thing to take into account is that the writes on the table will be a bit slower because of the newly generated columns and the new indexes.
I think you have an error here:
SELECT count(*) FROM film WHERE language_id = (
CASE ‘Mandarin’ <– should be language, no?
WHEN 'English' THEN 1
WHEN 'Italian' THEN 2
WHEN 'Japanese' THEN 3
WHEN 'Mandarin' THEN 4
ELSE 0
END
)
Hi Jonathan,
No, the query is correct. ProxySQL will change to the CASE statement, and that case will be simplified to 4, at the end, mysql will execute:
SELECT count(*) FROM film WHERE language_id = 4;
which is what we are looking for.