Seismic Shock in the Analytics World

Yandex have released a free columnar store analytical database called clickhouse.
It seems to be using MySQL from what I can tell (correct me if I am wrong), but its obviously their storage engine and they added some math and aggregation functions.

If it is anything like Infobright then you can expect x50 compression, so for most of us mere mortals that means you can keep everything on one database.

It has (really interesting) replication with zookeeper.

It has materialised views which is absolutely huge.

The benchmarks are here.

It is a lot faster than hadoop/hive. Not sure about the Cloudera/Impala or Spark, but this would potentially be a lot simpler to administrator and is free, of course.

It also comes with tons of aggregation functions, geospatial functions, math functions... a treat for any data scientist.

Watch this space.





How to Speed Up the Database Behind Your API

I have changed my tune and jumped on the bandwagon for having APIs, BaaS, mBaaS, etc..
I like the idea, the flexibility and the potential it has.
As such, I have done some research and will now provide some practical advice on how to speed up the database that powers your API.


1) ORMs vs YeSQL - Speed


I have recently taking a liking to Clojure. Clojure uses something very simple and elegant to connect to databases and that is YeSQL. YeSQL is basically an SQL template engine which largely separates  SQL queries from the code. In an SQL file, the SQL queries are kept with annotations and those annotations can be used as the names of the functions in the code.
This keeps the code clean and the SQL queries separate.

There have been discussions about why using SQL is more performant than an ORM.
I will add that in my professional opinion as a DBA, ORMs will:
  1. Make things slower
  2. Give you issues with the relational/object mismatch
  3. Make it difficult to find where the performance issue is
  4. Make it harder to fix the ORM query after you/I find it in the database (is it easy for you to change an ORM generated query to include an index hint? could you change a join to a specific sub-query?)
The truth is that SQL is an extremely powerful DSL for reading and manipulating data and is very mature and thoroughly tested.


2) ORMs vs YeSQL - Teamwork and Monitoring


With YeSQL, you can write your SQL query with a few parameters like: WHERE date = :startdate
and your application can fill those parameters at runtime.
If you should need to make changes to your SQL queries,  as long as you don't change the parameters themselves, you do not need to update your application code.

Stringing together that idea of easily updating your SQL queries and separating your SQL code to an sql file, you now have 2 very substantial features:
  1. You can ask me, a database performance engineer to make the queries run faster by providing only your sql file. I can give you back the sql file with faster performing queries and as long as they have the same parameters and produce the same result sets, you don't need to make any further changes in your application code. This means you can easily add different members of a team and different experts to work on your system.
  2. Depending on where you host your API, you may NOT HAVE access to performance information from the database. So knowing which SQL queries to tune, will help you find the worst performing ones.
Some of the hosting providers will only give you performance information about how long your API takes to answer calls. Some might give you more access to the database or maybe the MySQL performance schema. It varies on the hosting provider, and it would be helpful to keep your queries in a nice and clean place, just in case you need to find the worst performing ones.


3) More Monitoring


I recommend you set thresholds for your database queries and log those somewhere. For example, log all queries + their parameters to a log file or a database table if they take longer than 0.5 seconds. If you have N+1 queries, then maybe log the sum of those set of queries.
This will help you a lot later on and will help you be more proactive in finding issues and depending on your vendor, you may not be able to get this information any other way. You may also be able to connect that information to a alert and monitoring software. 


4) Caching - Memcached


Depending on where you host your API, you may have the ability to have a caching layer sit in front of the database. I would recommend this and it would prevent your application hitting (maybe even literally) the database too many times.


5) Caching - Varnish, Nginx


Depending on where you host your API, you may have the ability to cache the calls of the API. In most cases that I have researched, this is usually a default option. If not, then I would recommend that you add it, but I am guessing that you probably knew that.


6) Caching - Summary tables, Count tables


One of the strategies which you can use as an engineer are creating new tables that have summarised information. You can then query those tables to get faster response times. I have a video about this in the context of big data, but it very much applies here as well as a form of caching.
Consider generating and maintaining tables that have a summarised version of your existing tables and that are substantially smaller. You can also "enrich" those summarised tables with data that will prevent you from doing more JOINs. Then fill those tables in a batch job when needed.

This strategy is particularly useful for improving response time for reporting queries on projects that you inherited with highly normalised database tables.


7) Data Modelling - Do it Properly


Take the data modelling part of the application very seriously. I am a bit biased about this but, consider bringing an expert at the beginning or early in the project to help plan your database tables to be optimal.
Take your time to think about how you would run queries against those table in the long term.
What would be the best way to store and query the data for the requirements now and in the possible future?
Don't just stick everything in a JSON/text and hope for the best.
Don't just normalise it to the 3rd, 4th, 5th, 6th normal form.
Plan it carefully.


Some sites you may want to look at


Database APIs - the rethink

Previos post: How to Speed up Database APIs

In a recent article on news hacker, I saw this post from Derek Sivers
There, he talks about simplifying work and how the database has stayed the same through out the years, but his code and coding languages have changed many times.

He recommends to put all the data-related logic (not business logic) in the database. Things like constraints, data quality functions as well as JSON outputs.

I feel that this is similar idea to database APIs and now even MySQL 5.7 allows you to connect with JSON to it directly which may well remove the need to have a database API in some cases.

In MySQL you cannot actually have this sort of data logic like you can in Postgres, because MySQL triggers are way too slow. The whole premise of MySQL and indeed all NoSQL databases is to make the database as dumb as possible. In fact, database APIs are usually written to make the API database agnostic and therefore as basic as possible so that you may one day move over to a better database if such a case arises.

I agree with Derek that the database rarely changes and that data-logic, in an ideal world, should be in the database. This will not make the database a bottleneck. If for example, you write a stored procedure that dramatically saves data from being sent over the network that would be a good scenario. Versus sending a ton of data over the network and then processing it in your application which would actually take more time even if you try to parallel process it.

The thing I am wondering, how would you keep this data-logic in MySQL anyway?
Bite the bullet and use triggers?
Or would you still use a database API, but either keep it close to the database server or give it some cache?

Please comment your opinion below.