Top Slowest Queries and their Associated Tables in MySQL

The following query gets data from performance_schema in MySQL and attempts to regex the digest to the list of tables in the same schema.
 SELECT d.*,  
  (SELECT group_concat(distinct TABLE_NAME) FROM information_schema.TABLES   
 WHERE table_schema = d.schema_name and d.digest_text regexp table_name) table_name  
  FROM performance_schema.events_statements_summary_by_digest d  
 WHERE d.DIGEST_TEXT regexp "^(SELECT|UPDATE|DELETE|REPLACE|INSERT|CREATE)"  
 and d.LAST_SEEN >= curdate() - interval 7 day  
 ORDER BY d.SUM_TIMER_WAIT DESC limit 10\G  

Top 3 Reasons Why SQL is Faster than Java

I had a discussion with a colleague the other day. He was trying to write some SQL to use for a less-than-optimal data structure and was getting frustrated that it was looking "cumbersome". He wanted some advice, but was keen to simply write it with a mix of a few light SQL statements and some Java.

I would like to explain why this option would be slower than using "ugly looking" and "cumbersome" SQL:

1) Disk I/O


If you were to use Java, you would need to probably get a larger dataset from the database, process it in some way and output the results. This would mean that the database would need to fetch that larger dataset for you which would mean more (sometimes much more IO)

If you were to use SQL, you are leaving the fetching operation to the database's optimiser and with the help of indexes - may not fetch as much as with Java.

In short, you are allowing the database to reach the right data and filter what not to fetch - for you.

2) Network


For the reason above, the large dataset normally has to travel over a network. This is unless the java app server is located on the same machine as the database. This is not very common nowadays.

That network overheard can become more pronounced in a virtual or containerised environment where network issues can be a headache. (*note: I am not an expert, just observing from a distance)

In addition, needing to pass data through a network can be an additional overhead in parallel systems where the data needs to travel to each machine before starting an operation.

3) Java's Garbage Collection


It may not be known to most people, but Java adds quite the memory overhead for objects and some data structures. You can sometimes get a x100 difference. This does not mean that that you need x100 available memory, but it would mean that the GC would work extra hard with more CPU cycles to clean up the extra memory churn.

Bonus: SQL takes far less code than it would do in Java


While SQL can be an ugly string in your code sometimes, doing it in Java can take between x30-x100 more lines of code including tests. You may also need to test that your code does what SQL already does well such as JOINs and aggregate functions.


Caveat: When it is a good idea to use Java over SQL?


For processing a lot of data, Databases have the following concept:
Row vs Chunk vs Too Big

'Row' would be the slowest way of getting and processing data - unless you need to guarantee some level of data quality which requires it.
'Chunk' or a set of rows, is just right. Usually this would mean querying a large table by using 2 or 3 keys and get a result set that the database can handle well.
'Too Big' is a case where the database cannot handle well the number of rows and you would need to split your SQL into 'Chunks' using Java and process it that way.


Please also check out this book that I found useful in this matter:


What is a Good Data Model

This is an excerpt from something I am working on

A well data modelled table should

  1. Be able to retrieve data quickly 
  2. Be able to store data quickly
  3. Be clear and easy to work with

A well data modelled table should not

  1. Store unneeded data
  2. Need to change its rows very often **
  3. Need too many JOINs to get you the data that you need

Purposes of a Data Model

A good data model should serve a specific and narrow set of purposes.
The more purposes the table serves the:
  1. More indexes it would need. 
  2. More cumbersome it will be to store and keep in memory.
  3. More overhead it would be to write to.
  4. More likely it be a single-point-of-failure
  5. More likely it would have locks and deadlocks
  6. More likely it would be to add unneeded data
  7. More difficult it would be to make changes to your application if you needed to make changes to the table.
If you notice a pattern here, you may notice that reusability to a high degree, may hinder the performance of a database. There needs to be a balance between reusability and single-responsibility of the data models to be effective. 

** A table that has data that needs to change often and is transient, may be better suited in a cache. If it needs to be saved and transactional, then a smaller table that records the state of certain keys or values with a combination of a log to store how it got that way if it is needed.