My Most Useful MySQL Performance Tools

Here is a list of my most useful tools that I use when doing performance audits.
Please note, I am writing this mainly for myself, because I sometimes end up trying to find them in my other blog post about mastering indexing and this may save me time as well as a few changes that have happened over the years.

Regular Slow Log Report
pt-query-digest slow_query.log  >slow.txt

All Queries (that use indexes) for a certain table
pt-query-digest slow_query.log  --filter '($event->{fingerprint} =~ m/^(!?select|update|delete)/) &&  ($event->{arg} =~ m/mytable /) ' --limit=100% >mytable.txt

Longest Running Select Queries - most painful queries with response time % right next to them.
pt-query-digest slow_query.log  --filter '($event->{fingerprint} =~ m/^(!?select)/)'  --order-by=Query_time:max > select.txt

Filter the slow log for all Select queries for a certain table
pt-query-digest slow_query.log  --filter '($event->{fingerprint} =~ m/^(!?select)/) &&  ($event->{arg} =~ m/mytable /) ' --no-report --output=slowlog >mytable.log

Find unused indexes from pre-filtered table's logs  
pt-index-usage mytable.log --host 127.0.0.1 --tables mytable >mytable_indexes.txt 


Find Top 15 Largest tables on a server (use with caution) - (from www.mysqlperformanceblog.com)
-----------------------------
delimiter $$
create procedure dba.largest()
begin

SELECT CONCAT(TABLE_SCHEMA, '.', TABLE_NAME) AS TABLE_NAME,
CONCAT(ROUND(TABLE_ROWS / 1000000, 2), 'M') ROWS,
CONCAT(ROUND(DATA_LENGTH / ( 1024 * 1024 * 1024 ), 2), 'G') DATA,
CONCAT(ROUND(INDEX_LENGTH / ( 1024 * 1024 * 1024 ), 2), 'G') IDX,
CONCAT(ROUND(( DATA_LENGTH + INDEX_LENGTH ) / ( 1024 * 1024 * 1024 ), 2), 'G') TOTAL_SIZE,
ROUND(INDEX_LENGTH / DATA_LENGTH, 2) IDXFRAC
FROM INFORMATION_SCHEMA.TABLES
ORDER BY DATA_LENGTH + INDEX_LENGTH DESC
LIMIT 15;
end $$
delimiter ;
---------------------------------

ToDo:
Run filtered log against database with the percona toolkit log player to test index improvements.


My Job in a Nutshell

(http://www.commitstrip.com/en/2014/08/01/when-i-help-a-rookie-coder-fix-his-queries/)

Lessons Learned through working with Infobright

Infobright is a great database technology for reporting and analytics, but it also comes with its own quirks.
Edit: Every single one of the columnar database technologies and even standard database technologies have quirks. Infobright does not have an unusually large number more than anyone else. Other columnar store vendors simply wont tell what their quirks are and you are left to discover them on your own. In some cases, they can be crippling or "maybe we need something else even though we've spent 0.X million euros on it" types of quirks.

I'd like to share those quirks with you now in an effort to make transitioning to it smoother:

  1. No user-defined functions - no select my_special_calculations(col1) from table; - unless you have a tiny dataset.
  2. No Views - you can use views, but they will not use the infobright optimizer and will use the MySQL one which will make it very slow.
  3. coalesce(col1,'') will cause query to infinite loop - use ifnull(col1,'') instead. coalesce(col1,col2,'') should work ok.
  4. Do not use "funny" dates or dates that don't exist - it will return no results. For example, 31st of June doesn't exist as June only goes up to 30.
  5. Enterprise version only - UPDATE/DELETE/REPLACE will be slow and in the case of UPDATE, fragment the data on disk.
    1. I try to use DELETE and INSERT INTO SELECT FROM to update data, but I have been stung a lot if my deletes are too large - they take forever - and updating individual rows is slow as well. For short, try to update as little as possible of the data that is already in infobright and try to do it in small chunks. Bulk INSERTS or LOAD DATA INFILE are no problem.
  6. Do not use scalar sub-queries: select id,..... (select id from newtable) as new_id from very_large_table where..
  7. Querying a large table (with very few WHERE clauses) AND adding many joins can still be slow. Try putting the large table in its own sub query:  select * from (select date, .. from largetable group by...) inner join othertable..
  8. Denormalize the smaller tables that you are joining on as much as possible - similar to star schema/dimensional tables
  9. Do not mix and match (infobright)brighthouse tables with myisam tables in your joins. Can really take forever. (Infobright doesn't come with innodb....yet)
  10. Try to use queries like - Select date, sum(col1), avg(col2), max(col3) from yourtable where... for best results. For example
Select sum(units) from our_one_tera_byte_table where id = 4
+------------------+
|     154964394949 |
+------------------+
1 row in set (19.07 sec)
A row-based database simply cannot give you a result that fast.


Some more additions:
11) There was an issue with getting 'time' from tables and doing a datediff on them. Example:
ib1@360reporting@16:29:20> SELECT c.start_time, DATEDIFF(DATE(c.start_time), DATE(NOW())), DATEDIFF('2014-01-01', DATE(NOW())) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 3;
+---------------------+-------------------------------------------+-------------------------------------+
| start_time          | DATEDIFF(DATE(c.start_time), DATE(NOW())) | DATEDIFF('2014-01-01', DATE(NOW())) |
+---------------------+-------------------------------------------+-------------------------------------+
| 2014-02-17 11:43:45 |                                    682088 |                                -148 |
| 2014-03-26 00:00:00 |                                    682125 |                                -148 |
| 2014-01-03 00:00:00 |                                    682043 |                                -148 |
+---------------------+-------------------------------------------+-------------------------------------+
3 rows in set (0.01 sec)

Solution was to use dateformat on it. Example:
SELECT c.start_time, DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 1;                             +---------------------+-------------------------------------------------------+
| start_time          | DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) |
+---------------------+-------------------------------------------------------+
| 2014-02-17 11:43:45 |                                                  -101 |
+---------------------+-------------------------------------------------------+
1 row in set (0.00 sec)

12) Currently the best method is in fact to have everything (long strings and all) in one large table. The data is stored in a very compressed way and when generating, does not require infobright to create a temp table if using joins.