30 Nov 2017

Archiving for a Leaner Database

There is an idea that data is sacred and needs to be stored forever. However, if you keep your data forever, you will, sooner or later, have a very large database.

In order to keep operations running smoothly, it would be useful to allocated data that is used more frequently in certain tables and keep data that is used less frequently in archive tables.



Some examples


You have a large table that stores transactions and it's size is 200Gb. It is that way, because your company has been around for 5 years, but in the last year, your company has been successful acquiring new users and your data has doubled.

Congratulations.

In your database, you now have a table that has 5 years worth of data, but your application usually only needs about the last 1-3 months. There may be a use case where someone might require data about a customer for a period starting a year ago and there may also be a reporting request to aggregate data for the last 3 years. Therefore, to play it safe, we need everything in one table.

However, this greatly effects performance. It would be more helpful to try and separate those 3 concerns into 3 different tables:
  1. A table for a 3 month period for frequently used data
  2. An archive table that keeps all old and infrequently used data
  3. A summary table for reporting
With these, we are complying with the principle of Single-Responsibility and greatly improve performance for each purpose.

Having a 'main' table with only the last 3 months worth of data, greatly allows you to scale.
For example, even if your data doubles every year for the next 3-5 years, you still only have to manage a subset of that data. So if those 3 months once took a table 20Gb to store, the year following would be 40Gb and the year after would be 80Gb: These sizes are still very manageable by todays standards.
In addition, hardware and software improves over time, so there can be a legitimate expectation that simply by upgrading and updating, you can keep humming along.

Taking the effort to identify 'hot' and 'cold' data and allocating it to the right tables, can mean that your scalability concerns will be addressed for the long term.



How to implement Archiving?




Table _archive


One way to implement archiving, is by having a table that ends with _archive.

To enable this, you will need to be able to redirect your queries (from your code mainly, or by a proxy that can do that) to the main or the archive table, based on a particular criteria.

For example, if the date is less than today's date minus 30 days, then send it to the archive table, if not, then the main table.

Another example may be, if the status column equals 'inactive' send to the archive table.

You would largely need to dig through your code for that table and wrap it with an IF statement to send to the right.

You would also need a data process that migrates data from the main table over to the archive table when it gets old or becomes cold.



Partitioning by Date


While this is not a different physical data model, this does help split the table into a few tables and achieving the desired purpose without application code changes.

Is it very common to partition your table to specify which data may be old and allocate it in the right partition, based on date.

mysql> CREATE TABLE `largetable` (
->   `id` bigint unsigned NOT NULL AUTO_INCREMENT,
->   `dateCreated` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
->   `status` int default 1,
->   `sometext` text,
->   PRIMARY KEY (`id`,`dateCreated`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.03 sec)

mysql> alter table largetable partition by RANGE(YEAR(dateCreated)) (
-> PARTITION p2016 VALUES LESS THAN (2017), 
-> PARTITION p2017 VALUES LESS THAN (2018), 
-> PARTITION p2018 VALUES LESS THAN (2019), 
-> PARTITION p2019 VALUES LESS THAN (2020), 
-> PARTITION p2020 VALUES LESS THAN (2021), 
-> PARTITION pmax VALUES LESS THAN MAXVALUE);
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0

The above example, allocates data by which year the row was created. Please note, after 2020, this sort of manual partitioning will require manually adding new years to this table. If you do it in advance, this can be done without disrupting operations.



Partitioning by Status


You can also have a partition (as mentioned above) to a status column to active/inactive and simply by using UPDATE to change the value MySQL will move over that row to the right partition. REPLACE or INSERT + DELETE will work as well.


mysql> CREATE TABLE `largetable` (
->   `id` bigint unsigned NOT NULL AUTO_INCREMENT,
->   `dateCreated` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
->   `status` int default 1, -- default active
->   `sometext` text,
->   PRIMARY KEY (`id`,`status`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)

mysql> alter table largetable partition by list(status) (
-> partition pactive values in (1), -- active 
-> partition pinactive values in (2) -- inactive
-> ); 
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select * from largetable partition (pactive);
Empty set (0.00 sec)

mysql> select * from largetable partition (pinactive);
Empty set (0.00 sec)

mysql> insert into largetable(sometext) values ('hello');
Query OK, 1 row affected (0.01 sec)

mysql> select * from largetable partition (pinactive);
Empty set (0.00 sec)

mysql> select * from largetable partition (pactive);
+----+---------------------+--------+----------+
| id | dateCreated         | status | sometext |
+----+---------------------+--------+----------+
|  1 | 2017-10-30 10:04:03 |      1 | hello    |
+----+---------------------+--------+----------+
1 row in set (0.00 sec)

mysql> update largetable set status = 2 where id =1 ;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from largetable partition (pactive);
Empty set (0.00 sec)

mysql> select * from largetable partition (pinactive);
+----+---------------------+--------+----------+
| id | dateCreated         | status | sometext |
+----+---------------------+--------+----------+
|  1 | 2017-10-30 10:04:03 |      2 | hello    |
+----+---------------------+--------+----------+
1 row in set (0.00 sec)



Partitioning by ID


And lastly, you can partition on the sequence of your auto incrementing id key.


mysql> CREATE TABLE `largetable` (
->   `id` bigint unsigned NOT NULL AUTO_INCREMENT,
->   `dateCreated` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
->   `status` int default 1,
->   `sometext` text,
->   PRIMARY KEY (`id`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)

mysql> alter table largetable partition by RANGE(id) (
-> PARTITION p1 VALUES LESS THAN (500000000), 
-> PARTITION p2 VALUES LESS THAN (1000000000), 
-> PARTITION p3 VALUES LESS THAN (1500000000), 
-> PARTITION p4 VALUES LESS THAN (2000000000), 
-> PARTITION p5 VALUES LESS THAN (2500000000), 
-> PARTITION pmax VALUES LESS THAN MAXVALUE);
Query OK, 0 rows affected (0.06 sec)
Records: 0  Duplicates: 0  Warnings: 0

The above example specifies which partition the row should goto based on the range of what the id number is. This example is more useful if your system does a lot of primary key look ups. It also helps with distributing the table sizes more equally when compared to dates, as you can have more data in recent years.



A word of caution


Partitioning on the right key is absolutely crucial and not easy. You need to analyse the queries that the application sends to that specific table and come up with a partitioning key(s) that works well and does not slow down the table - at least not the top 80% of the slowest queries.

The partitioning key would need to go into the PRIMARY KEY and in order for the optimiser to send you to the right partition, that key would ideally be included in the WHERE clause of all SELECT/UPDATE/DELETE queries. Otherwise, your query would run sequentially through each partition in that table.