28 Mar 2011

How to Make MySQL Cool Again

I feel that MySQL has lost its coolness factor.
Why do I think that? Well, because most people are interested in other things nowadays...
Other things like NoSQL - It's what all the cool kids are talking about.

NoSQL is as cool as it is partly because MySQL wasn't performing as fast as some people would have liked and eventually they looked for other alternatives.
Meaning that MySQL was the default database when people were installing their PHP/Ruby/Java/Python...etc website, but after some time, it got slow and difficult to maintain.



Difficult to Maintain (or not Web scale)

Lets try to analyse what "difficult to maintain" really means...
It could mean any, some or all of the following situations:
Your website got bigger -
  1. Immediately (flood of users/went viral) and you had no idea what to do (you are a Facebook app)
  2. After a short time, but you are working on new features and have no time (your sales/marketing department are ninjas)
  3. After a long long time (more than likely that you just haven't archived your old data in the last 10 years as well as you have more users now) 
  4. Because you added so many features to the site that everything runs at a crawl (forgot to benchmark on hardware or optimize processes for your hardware).


Who should solve these problems? 

Well, usually (in the companies I've seen) it mainly falls on the developers and maybe the system administrators. Companies of a certain size don't tend to hire someone specific to maintain the databases as their only purpose in the company.

Developers (in most cases) just are not interested in this database stuff. Not all of them, but a lot of them that I ran into would much rather discuss design patterns, agile development... and generally solving interesting programming problems.

System administrators usually have a lot on their plate to be able to mess around with the database beyond the minimum  to keep things running. 

Now, I am going to make an assumption that developers either have to solve all or part of the problems that the database causes or get very frustrated with these problems occurring. Probably both.

Web companies put a lot of pressure on these developers to release new features or fix existing bugs. If the developers say that by switching to a different technology, they will get things done faster, there is a certain possibility that these companies will implement those changes.


"Big Data"


"Big Data" is the new thing. Its interesting and cool. But what is "Big Data"?
Wikipedia defines it as: "Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools."

So the difference between "Big Data" and a large or very large dataset is if using your DBMS is too awkward to handle it. With a definition like that, it is clear to me why more companies are trying to move away from relational databases.

My understanding of the trend is this: companies with, in their mind, very large datasets (which can honestly range from a main table that is 10Gb to 100Gb, from real life examples) and have a very difficult time maintaining or scaling their database, would (of course) consider themselves to have "Big Data" and look for solutions for it. Considering that its now interesting and cool, why not at least take a look at it and even give it a try in the company?





Is NoSQL the Enemy?

NoSQL is not the enemy. Key/Value stores, document-based databases and similar NoSQL databases can be a very good tool for a certain problem.
We've been using Key/Value stores in memcached for years. Document-based databases can be perfect as a replacement for Entity-Attribute-Value data structures or unstructured data.

There are also companies that their main function is only to process large amounts of data and they do it very well with NoSQL databases with Map/Reduce.

I didn't go into this at all, but for a website with a lot of transactions, NoSQL could also be a very good solution, depending on the situation.

The "enemy" is the hype around NoSQL as the one-size fits all solution and the FUD about scaling MySQL.
(in my opinion)



How to Make MySQL Cool Again

So without further ado, here is my completely opinionated list of how to make MySQL cool:


Gearman

I believe this is one of the MAIN things that will make MySQL cool again.
One of the main problems with MySQL is that it doesn’t do things in parallel very well. 

Gearman can work with MySQL to run queries in parallel. Gearman can also work as an addon to branch out work once data is updated in the database. This leaves a space for MySQL to do some very interesting things and doing those things asynchronously. 
With regards to processing “Big Data”, you can always process some parts of the data you need or all the parts at offline times or off peak times. Doing this using Gearman can prove to be very very scalable and efficient. 


Map/Reduce


Map/Reduce isn’t just for NoSQL. It can also be used with relational databases. There are already proprietary databases that use some form of Map/Reduce, such as Greenplum, Teradata and Aster.

There is an argument that Map/Reduce can or should be used more for “offline” work, but I do not know enough about that to make a judgement on it.

There is already uses of Hadoop with MySQL and things could get interesting if this trend increases.



Column-Based Storage Engines


MySQL has a few of these already as storage engines: InfiniDB and Infobright. What these engines do is store the data in a way that allows you to do run queries on large data. These queries are specifically used for reports (these engines are not terribly good at many small insert/deletes). These engines save you from guessing which index to put on where because every column is a sort of index. Lastly, they also make an effort to compress the data.

Now these storage engines are not always that cool, at least not to developers, because they usually require some sort of work on the data so that it can be used efficiently. This work is called ETL (extract, transform and load) and now we’ve stumbled onto the world of Data Warehouses.

Data warehouses have been around for decades and have been a very good solution for querying large data sets for business intelligence.
The problem is that traditionally, companies who have used business intelligence have been quite big and with a lot of money. The tended to hire database administrators, database developers, data analysts and “power” business users analysts (think experts at Excel and Pivot Tables). Those people are already familiar with Data Warehouses and business intelligence and it won’t be too difficult for them to implement it (albeit it will take them a long time and be expensive).

Startups don’t tend to hire these kinds of people, at least not straight away and are not too keen on the “a long time and expensive” part of the deal.
Even though there are open source BI suites (Pentaho and Jaspersoft) and you can use MySQL or Postgres as well, the knowledge and skill set for implementing business intelligence is limited and possibly expensive.

NoSQL advocates claim that NoSQL with Map/Reduce will give small companies the power to compete with the “Big Boys”. However, I do not see a way for NoSQL databases to give tools to business users to query and analyse data on their own the same way BI suites or even SQL as a language does at the moment.


Play Around with Shiny New Hardware

MySQL is getting better and better at scaling on  faster hardware. Why not have fun with it and load up on memory or experiment with new storage like RAID, SSDs, FusionIO/Verident or SANs.

Forget commodity hardware. Adding 32Gb of memory to a server that comes with a default 8Gb nowadays is inexpensive.

It could be quite cool to buy something for a few hundred dollars/pounds and see a dramatic increase in speed on your MySQL server.


Have Out-of-the-box Tailored Configuration for MySQL

No really. Why can’t the regular installation of MySQL ask you 3-5 questions about “how do you plan to use MySQL?” and then change the my.cnf file accordingly.

It would be kind of like those 3-4 different my.ini files that came/comes with MySQL for Windows.
Questions can be like “Do you mainly want the database to be transactional?” and “I notice that you have 16Gb of memory, would you like me to allocate 50% of that memory for the database (innodb buffer pool)?”
You can get a lot of mileage out of MySQL if you just configure it slightly.

It can be quite cool if MySQL “just works” for much longer if configured more suitably to the server it’s on. In addition, you can add to this idea at the end: “MySQL has now been configured to your server. Now you can relax”.


Materialized Views

It is apparently a well kept secret in the Web world, but almost all the proprietary databases have something called Materialized Views (some not with that exact name).
What is it? It’s basically a copy of a select query that’s already been processed and is on disk for you to reference. There is no need to generate the query from the main table over and over again. If the data that you need is in the Materialized View, then querying it directly will be (should be) very fast and non-locking to the main table it’s referencing.
These proprietary databases usually also maintain this view themselves and allow you to specify how often you would like the data to be updated.

So imagine you have a large table where you keep all your orders. This table is inserted to quite often and at the same time, the business users in the company like to run reports on this table. This can cause problems sometimes and usually the way to solve this issue, is to allow the business users to run their reports on a slave database (which is always a good idea).
If you know that the business users are only interested in the grouped by data of this table for each month, you can compile a separate table with this data which would be much much smaller and faster to generate reports from.

MySQL doesn’t have built in Materialized Views, but luckily, there is Flexviews which can do the same thing for you.
In a very short summary, it’s a package that parses your MySQL binary log with PHP (think tail –f on the binlog) and then applies it to the tables which will be written to disk. You can use it to generate the whole table (which would be the same as: CREATE TABLE mv_orders AS SELECT * FROM orders GROUP BY MONTH(`date`),YEAR(`date`);) and you can use it to generate incrementally as I have explained above.
It’s really not intimidating to use and Justin Swanhart at Percona has recently written some blog posts to explain how to use it.

For short, it could be quite cool to generate this kind of view quickly and give it to the business users (or even your own reporting application). Everyone is happy and you can back to what you were doing before.


Sharding

Sharding is all the rage and in particularly useful when you do use commodity hardware. While its quite cool to talk about it, implementing it here can actually be a little more work and a little less cool.

I’ll give an example of how application sharding could work:
Lets say you have your website, everything is running fine apart from this 1 client that really kills your server. So you decide that you will “shard” the data and give this 1 client his own server.
You then go over all your code and add IF statements to say:
IF customer_id = 1 THEN use private_server1
ELSE use regular_server
You then go over all your tables and decide which are “global” tables and should be on both servers, which are primary tables which can be sharded and which are child tables which rely on the primary table.
You also might like fail-over and redundancy, so you create a Slave server for both servers. Now you have 4 servers and 2 backups that you need to maintain.

There is a company that might help you do this, but from what I saw, they are mainly for Java apps.
You can also try using MySQL proxy.
Whatever you choose to do, you might run into some gotchas.

What you will get is smaller databases that run faster which can be cool and you can definitely go to some conference to talk about it.


Tungsten Replicator

Tungsten Replicator made it on my list because it can solve quite a few scaling problems MySQL can have. The important ones to me are:

  • Multi-source replication - You can have 1 to many MySQL servers collecting data into 1 "hub" and then replicate that over to the other servers.
  • Solve replication lag by "sharding" the replications into several streams. I know quite a lot of companies that split up their databases by which application uses them (functional partitioning) to solve the problem that way.
  • The enterprise version allows you to change masters instantly in the event of a crash. Its high availability at a cost but can save you a lot of time and trouble.
  • Hidden ETL functionality - Can replicate data from several MySQL DBs (or Postgres) into 1 DB for reporting or back ups.
I haven't played around with it that much, but it looks to have similar functionality to Flexviews where you basically tail -f the binlogs and then you parse it in some way.
In any case, it's quite cool and there is a lot of blogs and presentations out about it recently.



MySQL Cluster

Also on the topic of using commodity hardware, you have MySQL Cluster.
Not yet as cool to initially set up, but can be very cool for high transactional websites. It has been battle tested by a large number of telecoms companies. MySQL Cluster has also recently received a lot of attention inside Oracle (I believe this is because it’s a bit more difficult to maintain and its easier to sell commercial support for it). New developments to it is to integrate it with memcached directly and easier to work with it using Java.

Personally, I am waiting for MySQL Cluster to get a bit more critical mass. The work on memcached API looks interesting and I also recommended a name change to make it sound cool.


Scaling Up MySQL

If you do feel you have “Big Data” with your MySQL databases, please consider doing the following:
  1. Search the Internet – there is a HUGE amount of information available on the internet to help you with your MySQL problems. There is (still) a very friendly community behind MySQL and there are people falling over themselves to help you on stackoverflow, serverfault as well as MySQL forums and IRC.
  2. Hire someone to take care of your MySQL Databases – either hire a full time/part time MySQL DBA or talk to a MySQL consultancy company. Your main options are: Percona, SkySQL (both only support MySQL) and Pythian, Blue Gecko and PalaminoDB (which do a mixture of MySQL, Oracle and MS SQL). There's also this little known company called Oracle which owns MySQL and supports it.
  3. Send someone in your company on a MySQL course. Percona and SkySQL do those as well.
  4. Buy better hardware. Again, get some memory,  faster disks or buy a new server entirely. In a lot of cases, this can be the cheapest option if you consider how much lost hours of productivity your developers are wasting trying to tweak the code to your existing MySQL database.


Disclaimer:
All that I have written here is my opinion. There is no hard data supporting my claims. These are just trends that I have noticed and am expressing my thoughts in my blog. I reserve the right to completely change my mind if I or someone else finds new data or different trends.



Thank you for reading this blog post.

7 comments:

  1. Hi! Thanks for the mention about Tungsten Replicator. We Tungsten guys are all over the problem of making MySQL cool again. You framed the issue perfectly.

    I would add one more suggestion for ensuring a return to coolness: give it a little time. Many of the things that people don't like about SQL databases (example: defined schema) are there for a reason but sometimes we have to relearn the problems that those features solve to appreciate them fully.

    ReplyDelete
  2. Thank you for that, Robert.

    I guess there is some wisdom in that "relearning the problem" part of your comment. I look forward to it.

    ReplyDelete
  3. Hi,

    Thanks to your links to Flexviews and Shard-Query.

    ReplyDelete
  4. If you need some sharding, you can use ScaleBase - it works with any programming language, and with your existing BI tools, MySQL Workbench, etc.

    ReplyDelete
  5. Fascinating post, thank you.

    ReplyDelete
  6. Hi,
    Didn't read your entire article, but I didn't spot any mention of MariaDB

    See http://kb.askmonty.org/en/mariadb-versus-mysql-features

    and

    http://kb.askmonty.org/en/dynamic-columns which are presumably a means to use 'mysql' in a NoSQL manner...

    David.

    ReplyDelete
  7. I love that you included playing with the hardware.It's one of the most common mistakes people do. Properly updating your server can save you a lot of trouble in the future.

    ReplyDelete