4 Jan 2018

A DBA Analyses 'The Phoenix Project'

Last year, I read 'The Phoenix Project'. I liked it and as an IT manager in the past, I did experience high blood pressure during the SEV1 scenarios in the book.



I also liked the way DevOps methodology helped solve issues with IT as well as help the company succeed overall.

As a DBA, however, I did have some things that didn't make sense to me about this story. 
Bare in mind that the two major incidents in the book were database related. So in this post, I would like to jot down some things I have noticed and how they could have been solved looking at them from a different lens.


Caution, Spoiler Alert



Incident No.1 - Tokenisation


In the first incident, a 3rd party supplier ran a script against the database to tokenise some personal data. This was related to an issue that information security highlighted, but had the result of effecting HR and accounting.

In the book, there is a complaint that there was no test environment to see if this script would have any negative effects on other parts of the organisation. 

Now to me, this does make sense and at the same time, makes no sense at all.

If you meant, that back in the day, it was hard to get full environments setup to test changes on your application servers, then you would probably be right. Today, perhaps based on the methodology that this book introduces, you probably do have those environments setup: either virtualised or in a container. 


Testing Database


What doesn't make sense to me is that is not having a test database. Now reading through the book, there are mentions of Oracle database and some MS SQL databases. As a mainly MySQL DBA, I have not always worked on those databases, but I have worked next to people who have. My observation is, if you were to have an Oracle database, you would almost certainly have other dev/test/UAT/staging/pre-prod database servers as well. 
Why do I think this? If you can afford to pay for an Oracle database, you would probably get more testing databases under the same license. License being the most expensive part when using Oracle.  
So a testing database to test things that may effect the precious and expensive database server is almost a certainty.


DBA as a Gatekeeper


Now it seems shocking to me that the DBA had not been involved in the process to validate this 3rd party script. Old school Oracle DBAs are involved in everything that happens on their servers. 
Need a user on the database? goto the DBA. 
Need a database server for a new app? please fill these in triplicates, detailing what would be the projected usage for the next 5 years. 
In most companies, an Oracle DBAs may even setup integration between other products like Oracle HR and finance. 
So how could you have run something that significant against the database without their knowledge is beyond me.

Assuming that a database field had in fact been messed up, then Oracle DBAs have a TON of really enviable backup and restore features. 
They can query a table to view all the backups that are available to restore from and choose the point-in-time that is closest to what they need. A DBA could simply restore the database, fetch the specific table that had its column changed and apply it to the to production database. 
Its more than one table? Restore the database, go over the changes in the logs a point-in-time and skip the parts the conversion script applied.


It seems to me that the authors wrote the book based on their own experiences, but those experiences occurred in companies that had no DBAs. Not having a DBA is a product of start ups, not old school 1500-person car-parts manufacturers.


Incident No.2 - Conversion


There was a crippling database issue to do with a database conversion that was needed along side some new code roll out. The issue caused a 2 day - break out the hand held receipt machine - downtime to the system.


Works on My Laptop


During the initial investigation, a developer said something along the lines of 'it worked fine on my laptop' when describing the performance of the database conversion scripts. The problem was that on production, it was x1000 slower. Now, I have written about how to not be the one that deploys that slow query to production before and this really states that situation. Apparently, they still didn't have a database testing environment to test it against.

However, on the topic above of 'DBA as a gatekeeper':
Why didn't the DBA review the conversion scripts or was involved in the the code review process for SQL statements?
It could be that there wasn't any in the company.

Another point was that they couldn't cancel the conversion after they started and noticed how slow it was. If this was within a transaction or a single alter table statement, why not?
If too many things have changed, could they not restore the database to a point-in-time before the changes were made?
Was the conversion x1000 slow instead of maybe x10 slow, because of a foreign key check that could have been turned off?
A DBA would have given you those options.


Project Unicorn


After the hero turns things around and things begin to pickup, they decide to start a separate project to add predictive features to the main project. In it, they decided to bypass seeking permission for database changes and create a new database where they copied production data into it from several locations. I very much like this approach and it falls in line with the reactive micro services pattern

This would make this book ahead of its time. Instead of managing one main database (although, they did mention in the book that had a couple of dozen database servers) for the website, they can break it up into several database servers, based on functionality. What is required is to use tools - and I would believe in 2012, they meant ETL tools - to migrate the needed data into these new database servers. 

This would still need a DBA though or at the very least, a data engineer with an ops background, as you now need to:
  • Data model new environments based on data from old ones
  • Create and maintain data pipelines
  • Monitor for errors and fix data that didn't make it 
  • Fix data drift and re-sync data across servers

In addition, you now need to backup, monitor the availability and performance of these additional database servers.

So while it adds complexity to the backend and you are now moving from simple database maintenance to a more data architecture role, it is the way forward. Certainly the only way to have proper micro services with their own single-purpose and loosely coupled data stores.