Mastering MySQL Indexing

Indexes are tricky things. In my experience, indexes are added whenever SQL queries are too slow. This makes sense. However, sometimes these indexes were added without being thought through enough (I am guilty of this). Sometimes they were thought through, but the table itself now has so many indexes that the optimizer doesn't know which one to choose from. 
The overhead of indexes greatly effects tables that need to be written to or altered often (even indexes on NoSQL database greatly effect writes).
Over indexed tables inflate the database size dramatically which adds to people's concerns that the database becomes unwieldy.

I have written in the past about "why are your indexes larger then your data" which contains methods to find unused indexes. This post contains a lot more research and testing since then and I believe offers a more holistic approach to indexing.

Note: the examples in this post are taken from a system running Drupal.


Tools


Lets get started. To find how to index your tables, you need to gather information about how your database server is being used. The tools I use are as follows:

pt-query-digest 

Since the days of mysqlsla, this tool has come far. It is the bread and butter of any performance audit. If you want to know which queries run on your server, you will need this tool to process those queries and order them nicely by worst/highest response time. There are many ways to actually use pt-query-digest and I will show a few of them in this post.
However, pt-query-digest has its issues. It is extremely resource intensive. So much so, that it is not recommended to run it on your main database server, but a redundant slave or an unused machine. I personally had it OOM quite a few times and caused a 'kernal panic' on one occasion. Also, the larger the log file you use with pt-query-digest the more exponentially longer it takes to finish processing it.  I personally, do not recommend taking a log for longer than a 24 hour period or 30Gb of log file size, whichever is smaller.

pt-index-usage 

When this tool came out, I could not wait to get my hands on it. It is a tool that runs the same slow log you use for pt-query-digest. It queries a database server with EXPLAIN statements from what it finds in the log. It then calculates which indexes were not used and prints out an ALTER TABLE statement to drop those indexes.
The problem with this tool is that it is largely unstable and unsafe to use. By its nature, it will flood the database with EXPLAIN calls and force the database to open tables. If your my.cnf variables are not suited for this, you can basically max out the tables that can be open on your server. When an app sends an SQL statement to the server at that point, it will give back an error. If it sends an SQL statement to a table that's already open, it will give a regular result. 
Apart from opening too many tables, this tool can also corrupt MyISAM tables in the same respect that when a client doesn't close a MyISAM table properly, it will be marked as crashed.
One time, I saw it managed to corrupt some performance_schema tables.
The way I use this tool, is that I limit the work it should do DRASTICALLY: 
  • I choose a specific table I want to review
  • I filter the slow log to use only the statements which query that table
  • I limit pt-index-usage to only look for that table. 
If I miss the log filtering part, it causes me problems, so I do all the steps to play it safe.

user_statistics 

This is a google patch for MySQL that can be found in Percona Server and MariaDB. Although you can patch it yourself as well. 
It keeps track of all the tables that were read from or written to as well as the indexes that were read from. It also keeps track of users, but for this post, I will refer to the tables and indexes tracking. 
What is nice about this patch/plugin is that it is very lightweight - completely unlike the two tools mentioned above. You can get information about tables and indexes over the course of a few days to a few weeks which is very difficult with both of the previous tools. You can then for example, safely remove an index after noticing that it hasn't been used for 3 weeks. In a short period of time, you may miss indexes that are used for reports or daily aggregation statements. 
The problem with this patch/plugin is that you only get the number of rows that were read or changed. Those numbers can confuse you when one million queries could have read one row very efficiently and another query read one million rows very inefficiently. All that data is bunched together and you lose the granularity to know how it happened.


Timing is Everything


As I have eluded to before, there is a difference between the tools and the ranges of time each can offer.

A drupal table dp_users. Taken with mycheckpoint.
The above graph is taken with mycheckpoint and monitors the rows that are read on one of the most heavily read table on this system.
mysql> select table_name, rows_read from information_schema.table_statistics order by rows_read desc limit 5;
table_namerows_read
dp_users2302477894
dp_node1231318439
dp_comments1071462211
dp_userpoints1033073070
dp_search_index260154684
5 rows in set (0.00 sec)
After I found that out, I set up mycheckpoint to custom monitor this table, just so I can have a better appreciation for how this table is used.

Note: When I refer to "slow log" in this post, I am referring to a slow log taken on a server that has microslow patch (available in Percona Server and MariaDB) and that the long_query_time was set to 0 for the period the slow log was taken.

You can see that there are two spikes (around 16:00 and 7:00) for that day. 
If you take a slow log between 16:30 and 22:30, you will miss the queries that caused those spikes.
If you take a slow log for a short period of time around 15:00 and 18:00, that peak will skew your data and you will believe that the worst queries in your report have a much greater effect on your server capacity than they might do.
Alternatively, if you take a slow log for a 24 hour period, you may miss entirely the queries that cause the spikes as their impact has been diluted across a long period of time AND those slow queries may be the queries that have the most negative impact on your users.

There is no real ideal time period to measure. It depends on what you are setting out to do. If there is an issue with the system being slow at a certain peak time, then you can take a slow log for that period.
If you want a broad over view, you can try for a 24 hour period. 
But be aware that this information can be imperfect and that you might need to do some more digging and a couple of tries to get a better idea of what is happening on your server.

So what I am saying is, that there is more than one way to view how your queries run on your server.


How I would gather data


Here is a list of data gathering techniques that are order by hierarchy
  1. Ideally, a slow log from a database server with microslow patch and the user_statistics patch.
  2. Slow log with microslow patch (no user_statistics).
  3. TCPdump + 'pt-query-digest --type tcpdump' with additional filtering  (with no microslow patch) 
  4. General log and a slave server to run 'pt-query-digest --execute' on (but you should only execute the SELECT statements).
  5. Regular slow log with 1 sec or above (I would at least get slow queries)
  6. Polling a server 'pt-query-digest --processlist'. 
For my examples, I have information from option 1.


Processing the data


There are in fact, many ways to slice and dice your gathered data. I will demonstrate the methods I use and the reasons for them. The end result should be to:
  1. Find the worst performing queries to improve. Either by changing the SQL or changing the table indexes to suit them better.
  2. Find out which tables need index changes AND get data about which changes that will give the best overall results. 
  3. Verify that the changes you made have actually made an improvement or not.
With that in mind, what I try to find out from my data is:

Worst queries running on the server by response time 

This is really what you are interested in. Response time is measured by calculating the number of times a certain query has run and the time each query took. 
There is distinction between a "call" which is the count of that query versus response time. You can have a query that ran many times and had very low response time and you can have a query that ran a few times but had a very high response time. 
You are interested in the response time, but keep an eye on calls and 'R/call' which is response time divided by calls.

pt-query-digest slow.log > slow.txt
Time range: 2012-03-17  00:00:01 to05:16:20
RankResponse timeCallsR/CallItem
======================================
1480.9273 6.3%6000.8015SELECT dp_node dp_usernode dp_buddylist dp_users dp_node_access
2322.4220 4.3%1292580.0025ADMIN INIT DB
3314.8719 4.2%302200.0104UPDATE dp_users
4287.7109 3.8%516060.0056SET
5269.3434 3.6%6000.4489SELECT dp_node dp_usernode dp_buddylist dp_users dp_node_access
6238.8571 3.1%29021410.0001SELECT dp_url_alias
... 


Interpretation - Our queries are quite evenly using the server. What you would like to see is a nice 10%-20% or more type of query to be able to get 'low hanging fruit'-returns when you optimize those queries.
Queries 1 and 5 are many-join'd table queries.
Query 3 is an update statement.
Query 6 is a select statement which is used almost 3 million times during the period the log was taken, but has low response time.
Queries 2 and 3 are possibly some sort of bugs with the tool. To filter those issues out, you can run:

pt-query-digest slow.log --filter '$event->{fingerprint} =~ m/^(!?select|update|delete|create|call|insert|replace)/' > slow.txt


Worst queries running on the server that process a lot of rows

I call these 'bulky' queries. From my research, these kind of queries give me information about many-joins queries and sometimes slow queries. What these queries do is make the server work harder and I usually like to know what those queries are.

pt-query-digest slow.log --filter '($event->{Rows_examined} > 1000)'
Time range: 2012-03-17 18:44:23 to 2012-03-20 05:17:32
RankResponse timeCallsItem
=================================
2461.6309 8.2%1322SELECT dp_node dp_usernode dp_buddylist dp_users dp_node_access
21788.6931 6.0%1043SELECT dp_node dp_usernode dp_buddylist dp_users dp_node_access
31620.4995 5.4%4056SELECT dp_node dp_comments dp_node_comment_statistics
4978.7295 3.3%217645SELECT dp_users
5935.3744 3.1%715SELECT dp_node dp_users dp_node_access
6870.0052 2.9%1247SELECT dp_users
7845.3837 2.8%708SELECT dp_node dp_users dp_node_access
....

Interpretation - The command filters by rows_examined > 1000. This is open to change, but for me, it works for now. 
Rows_examined means that the server had to read more then 1000 rows from disk, process them in some way such as filtering or sorting and then display the result. 
Queries 1 and 2 look like they access the same tables. If it is a matter of changing indexes, I would need to find the root table in those queries and make changes to it.
Queries 5 and 7, same as above.
Queries 4 and 6 access the dp_users table. One is faster to respond and one is much slower. There is clearly room for improvement with those queries and that table.
One important thing to notice is that there seems to be more or less the same tables in all of the top 'bulky' queries.

Most Updated/Affected Tables

The tool pt-query-digest is quite a versatile tool and you can use it to get more information from your slow log than just worst queries. What I like to do is use it to get a list of top tables that are the most updated, written to or deleted from. I do this to help me know which tables I should be more 'aggressive' when it comes to removing indexes.

pt-query-digest slow.log  --filter '($event->{fingerprint} =~ m/^(!?insert|replace|update|delete)/) && ($event->{Rows_affected} > 0)'  --group-by >affected_tables.txt
Time range: 2012-03-17 18:44:21 to 2012-03-20 05:16:02
RankResponse timeCallsR/CallItem
'================================================
11880.1428 25.5%5304200.0035dp_users
2543.0294 7.4%5051.0753dp_xmlsitemap_node
3411.2947 5.6%2407080.0017dp_cache_page
4284.5491 3.9%6970.4082dp_node_comment_statistics
5276.1430 3.7%20070.1376dp_node
6270.5004 3.7%2171.2465dp_comments
7256.5459 3.5%5452480.0005dp_sessions
...

Interpretation - It appears that the UPDATE dp_users statement from the first report shows up here as well. It looks like dp_users is a 'hot' table that has many changes applied to it and that optimizing this table would help implement those changes faster.
It is also worth noting that dp_xmlsidemap_node, dp_node and dp_comments may run queries that are quite difficult to apply to the tables or that the tables are quite large and it takes a long time to apply to it. This could actually be a number of reasons, but it is good to get an understanding about those tables. We can start thinking about partitioning, index changes,  changing server variables or changes to the underlying file system. Once we made any changes, we can focus on those tables to see if it had an effect on them.


Most Referenced Tables

This is a report that allows me to see which tables have the most queries run against them that are or could use an index to get their results. This is helpful with indexing, especially if making a change to one table can have a useful benefit to many queries.

Note, while we are keeping an eye on response times, calls are also important. In my experience, the tables with high numbers in both usually show up in this report.

pt-query-digest slow.log  --filter '($event->{fingerprint} =~ m/^(!?select|update|delete)/) && ($event->{Rows_read} > 0)'  --group-by tables >read_tables.txt
Time range: 2012-03-17 18:44:21 to 2012-03-20 05:17:32
RankResponse timeCallsItem
'==========================================
117712.6402 15.6%19311364dp_users
25787.0542 5.1%62323064dp_signup
35154.7679 4.5%83492650dp_event
44712.0806 4.2%42915664dp_users_roles
52139.8459 1.9%42192873dp_url_alias

Interpretation - You may notice that I have duplicated the filters for UPDATE and DELETE. This is because they can potentially use an index to process their statements. It is important to note that badly indexed UPDATE and DELETE statements can greatly effect innodb. So there is some overlap between the affected tables report and this one.
Baring that in mind, dp_users maybe a heavily used table and we may need to investigate the queries that are run against this table. 
You can run this report again with only SELECT chosen if you want to get a different point of view on the data.


What do we have so far?


For the most affected tables, I should avoid from over indexing or reduce indexes if I can.
For the most referenced tables, I should make sure the current indexes are as effective as they can be or add more indexes. 
Please notice per my example, there can be situations when some of those lists may have overlapping tables - which means you have to be especially careful.

We have the worst queries by response time and the worst bulky queries

For good measure, lets do one more thing and get the tables that are connected with the worst bulky tables report:
pt-query-digest slow.log --filter '($event->{Rows_examined} > 1000)'  -group-by tables >bulky_tables.txt
Time range: 2012-03-17 18:44:23 to 2012-03-20 05:17:32
RankResponse timeCallsItem
==========================================
1
13745.2407 18.3%73055dp_node
2
13678.9905 18.2%416139dp_users
3
11212.2637 14.9%39703dp_node_access
4
4308.1558 5.7%2630dp_buddylist
5
4303.9987 5.7%2404dp_usernode
6
3172.2970 4.2%6630dp_comments

Interpretation - You are probably not surprised by this list of table if you had seen the bulky queries report. However, dp_node has now pop'd up on my radar and dp_node_access as well. 
I had seen some of these before on this list:
mysql> select table_name, rows_read from information_schema.table_statistics order by rows_read desc limit 5;
table_namerows_read
dp_users2302477894
dp_node1231318439
dp_comments1071462211
dp_userpoints1033073070
dp_search_index260154684
5 rows in set (0.00 sec)
So it can mean that the bulky queries do read a considerable number of rows from some of these tables 

Improving the queries that use these tables can have a very positive effect on the slowest and bulkiest queries. It might also possibly lighten the load on the server with regards to fetching a lot of rows which will improve IO.


On this system, reducing those bulky queries or optimizing them can help quite a bit.


Let's Focus


We have a lot of information about the system already, although some information me be more accurate then other. 
What we now need to do is focus. In my assessment, I have initially chosen to focus on one table to start of with: dp_users.
mysql> show create table dp_users\G
*************************** 1. row ***************************
       Table: dp_users
Create Table: CREATE TABLE `dp_users` (
  `uid` int(10) unsigned NOT NULL auto_increment,
  `name` varchar(60) NOT NULL default '',
  `pass` varchar(32) NOT NULL default '',
 ...
  PRIMARY KEY  (`uid`),
  UNIQUE KEY `name` (`name`),
  KEY `created` (`created`),
  KEY `access` (`access`),
  KEY `status` (`status`),
) ENGINE=InnoDB CHARSET=utf8
1 row in set (0.00 sec)

What I can do now is get all the queries that reference this table:

pt-query-digest slow.log  --filter '($event->{fingerprint} =~ m/^(!?select|update|delete)/) &&  ($event->{arg} =~ m/dp_users /) ' --no-report --print >dp_users.log
(filters the main log by queries that reference dp_users and use an index)
pt-query-digest dp_users.log --limit 100% >tbl_dp_users.txt

Time range: 2012-03-17 18:47:03 to 23:25:22
Rank Response time  Calls   Item
==== ============== ======= ===========
   1 209.2863 10.7%   88850 UPDATE dp_users SET access = 133******3 WHERE uid = 23****01\G
  -2 201.7256 10.3%      91 SELECT dp_node dp_node_comment_statistics dp_users dp_comments dp_node_access
   3 162.2711  8.3% 1309010 SELECT access FROM dp_users WHERE uid = 21***4\G
   4 139.9009  7.1%     197 SELECT uid, name FROM dp_users WHERE status = 1 ORDER BY created DESC\G
   5 133.8691  6.8%     327 SELECT * FROM dp_users u WHERE LOWER(name) = LOWER('s******s')\G
   6 109.6903  5.6%   29152 SELECT du.name, du.created, du.picture FROM dp_users du WHERE du.picture!='' AND du.status = '1' AND du.created BETWEEN '133*****0' AND '133*****60'\G
  -7  92.9095  4.7%  360642 SELECT dp_node dp_users dp_node_revisions
   8  74.2426  3.8%     106 SELECT * FROM dp_users u WHERE LOWER(name) = LOWER('hoa****rio') AND pass = '3837********5f9b' AND status = 1\G
  -9  70.3827  3.6%     436 SELECT dp_node dp_usernode dp_buddylist dp_users dp_node_access
-10  67.7836  3.4%     436 SELECT dp_node dp_usernode dp_buddylist dp_users dp_node_access
...

For the sake of clarity, I have marked in bold the queries that I want to focus on and that is because the other queries are join'd queries where the dp_users table is not the main underlying table.
Queries 1 and 3 use uid column and query 3 gets it's results from the access column. There maybe an opportunity for a covering index.
Query 5 and 8 need their SQL to change. If you wrap a column with a function, you cannot make use of the index very well and there is already an index on the name column. Plus, MySQL is case insensitive anyway.
Query 6 will probably bring up quite a few rows back. It uses status column to search on, created column to sort on and uses uid and name for its results.


Time for pt-index-usage.
pt-index-usage dp_users.log --host 127.0.0.1 --tables dp_users >idx_dp_users.txt
ALTER TABLE `dp_users` DROP KEY `access`;

Hmmm... so the index on access isn't being used at all or sometimes being used when another index is used in a higher number of situations. 
After testing I have done personally in the past, I do tend to trust the output that pt-index-usage gives me.


Indexing our Chosen Table


Our indexes are these:

PRIMARY KEY  (`uid`),
  UNIQUE KEY `name` (`name`),
  KEY `created` (`created`),
  KEY `access` (`access`),
  KEY `status` (`status`)


Let us remove the index on access, but I would like there to be a covering index for the majority of the queries (queries 1 and 3 considering calls and response time) on this table. I would like to try and make the primary key (uid,access) - which would make the primary key a covering index.
I also notice that the index on status would benefit from being a composite index of (status,created) - from queries 4 and 6.


So I plan to have: 
 PRIMARY KEY  (`uid`,`access'),
  UNIQUE KEY `name` (`name`),
  KEY `created` (`created`),
  KEY `status` (`status`,`created`)

I cannot remove the index on name because its a unique constraint. I am not sure to remove the index on created as it may be referenced by other queries.



Testing My Theory


I apply the table change to a slave database using ALTER TABLE or pt-online-schema-change (which I like a lot, but hate the one unique index limitation). It is important to test the indexes on a server that has data in the amounts that the master has. The optimizer makes its choices based on which index gives the best results for the query it is running, from the data it has.

You can test your results by running the following tool:
pt-query-digest dp_users.log --filter '($event->{fingerprint} =~ m/^(select)/)' --limit 100% --execute 127.0.0.1 --explain 127.0.0.1 >test_dp_users.txt

But in this case, the tool is running against a slave that it's cache is not as warm as the master's is and the results will not be good and you need to keep that in mind. 
In my report, it showed a lot of results in EXPLAIN as USING INDEX which is good.
You may also notice that I am only running SELECTs. This is because I do not want UPDATE/DELETE/INSERT statements to be applied to my slave server. So I do not have the full view of all the other queries.


I also ran the same pt-index-usage tool against the slave server and got this result:
ALTER TABLE `dp_users` DROP KEY `created`; 

Ok, so the index on created isn't used and I can remove it.

What have I done so far?

  1. I have identified a table that I think is a "hot table" or a table that I think is important to check the queries against.
  2. I have filtered a slow log to show the queries running against that table.
  3. I have made some observations
  4. I made some decisions to change 5 indexes down to 3. It is important to mention that I also enlarged the primary key, which isn't something you should do lightly.
  5. I tested them to some degree
  6. I ran pt-index-usage again for good measure
  7. Applied the changes to the master server and monitored it for a while

Applying and Testing


Since this table has a unique index I can use pt-online-schema-change and I ran this command on the master server:

pt-online-schema-change h=127.0.0.1,D=prod,t=dp_users -P3306 --alter "DROP KEY access, drop primary key, add primary key (uid, access),drop key status, add key idx_status_created(status,created), drop key created;" --drop-old-table --execute


I kept an eye on the process, everything went well.

I kept an eye on the server and noticed if there was no serious performance problem. Now, you would do your best to prevent any serious problems that may happen, but keep your eye on it non the less. 

A couple of days later, I took another slow log and compared the results.

RankQuery IDResponse timeCallsR/CallItem
160x9E5D9252DA7F38C870.3198 0.9%247880.0028SELECT dp_users 
140x9E5D9252DA7F38C8184.6790 1.2%631400.0029SELECT dp_users

totalminmaxavg95%stddevmedian
70s18us671ms3ms6ms5ms4ms
185s16us212ms3ms6ms3ms3ms

Interpretation - The first/top query is from the first slow log and the second/bottom query is from the second slow log. 
This query stayed more or less the same. The maximum went down quite a lot and the standard deviation and median went down by a bit. This means that they are more clumped together around the same time periods. 
All in all, not a massive change apart from the maximum.

Lets look at another query:

RankQuery IDResponse timeCallsR/CallItem
40x6882EF28DAF29509314.8719 4.2%302200.0104UPDATE dp_users
40x6882EF28DAF29509460.8619 3.0%1031260.0045UPDATE dp_users

totalminmaxavg95%stddevmedian
315s69us20s10ms925us306ms287us
461s91us25s4ms1ms220ms384us


This is a bit confusing to me. The response time is much better with the change. R/Call has gone down from 0.0104 to 0.0045 and that is a 231% improvement.
However, the minimum, maximum and median went up. The average and standard deviation went down which means the are much more tightly clumped around certain time periods. 
It appears that while I have reduced the overall time, I have added a little overhead in the process.

To remove the overhead, I can change the primary key back to (uid) and see if improves it. I cannot however, partition the table as there are two unique indexes in that table.


The End


Congratulations on reading the entire (and extremely long) article!

I hope that I have shown you new ways to holistically review a table's indexes and that these methods will benefit you.

Now go forth and experiment!





6 comments:

  1. You should try this tool: https://github.com/box/Anemometer/wiki

    It makes the kind of analysis you're doing a breeze, because you don't have to run pt-query-digest over and over again. Just process it once, and then query the results in different ways.

    ReplyDelete
  2. Thanks, I'll take a look at that.

    ReplyDelete
  3. I'm trying to use this great tool in Windows, with perl 5.10.1.2

    It works fine, but when I try to apply anyone of your "--filter" examples msdos command line says:

    perl ..\..\percona-too
    lkit-2.1.1\bin\pt-query-digest pcclonico-slow.log --filter '($event->{Rows_exami
    ned} > 1000)' > h.txt
    Error compiling --filter code: sub { my ( $args ) = @_; my $event = $args->{even
    t}; ( '($event- ) && return $args; };
    Can't find string terminator "'" anywhere before EOF at (eval 27) line 1.

    Does anybody know what I'm doing worng ?

    Thanks a lot

    ReplyDelete
  4. try:

    lkit-2.1.1\bin\pt-query-digest pcclonico-slow.log --filter "($event->{Rows_examined} > 1000)" > h.txt

    ReplyDelete
  5. Jonathan, thanks a lot.

    I seems to be a bit stupid. (je, je)

    Bye,

    ReplyDelete