Jonathan Levin's Axioms

Ideas, Rants, Theories vs Getting Things Done

26.11.09

Reply to SQL is dead, Long live SQL

Posted by Jonathan |

I read the article SQL is dead, Long live SQL and I have to say I really liked it. It spoke about the NoSQL movement and how relational database appeal to analysts.

This struck a cord with me. I always believed that what the regular applications play a small part (perhaps a very small part) of what happens to the data through out its lifetime.
After the data is inputted for the first time, it is analyzed, dissected, made sense of and placed in several reports and KPIs that are called for several times a day/week/month.

The business that paid for the application to be developed needs that data. It is not trivial to them at all. In fact, they may possibly make long term strategic decisions based on that data (in an ideal world). Hopefully, they use that data to improve how they react with their customers and the outside world, making them a better and more profitable company.

Which in turn, makes them hire the same developers to write them some more applications. It’s all a big cycle and data helps keep that cycle moving (or at least, that’s how I see it).

So if you are a developer and you are reading this, remember that even though it pains you to write SQL code or use an ORM, it’s all part of an important cycle that makes the world a better place. That, or you might get paid more. Whichever is more appealing to you.

29.9.09

Cycle of Acquiring Knowledge

Posted by Jonathan |

I have been doing some reading into data mining recently. It's really interesting stuff.
To explain it, lets take an example:
Lets say you work in a business that has been working for some years. They have their own way of running themselves and their processes. One day someone comes along and suggests that the business should automate some of their procedures by developing a website that would interact with customers and process their orders.
Some nice PHP/MySQL guy or girl is hired and develops the website and the cycle of acquiring knowledge has unwittingly started.

The diagram below describes how management starts to process its own business knowledge and passes them down the chain so they can be implemented into the application.
Eventually, every transaction is saved in the database. Over time, the database has more and more data and then the management of the company asks to get reports from this data.

Why would they ask that? Because they are interested to get an overview about the operations from this new part of the business.

What they could also benefit from is finding patterns and trends from the data which can help the business increase its own business knowledge. Meaning, data mining.

The diagram below explains the process for that:
Here the data goes upstream and back to the business's management. Assuming they find trends or patterns that help them be a better business and more competitive, they should apply that knowledge into the operations of the business.


Now, I have a question....

Lets say, that the business decided to take the NEW business knowledge and apply it to their website. For example, every person who buys a red hat would also probably buy a green tie (like what 'amazon recommends' does ) or if you come to this landing page, there is an 87.2% chance you are only interested in these lines of products and nothing else.
So they would ask the developers to add that new rule to the website which would in turn make the website better to its users (hopefully).

This means that the cycle went from down to up (like in the first diagram) to up to down (like in the second diagram) and back again. Ergo, a cycle.
And my question is.... how fast on average, do companies repeat the cycle?
How often do they learn from their data lying in their databases and apply it to their businesses?

I'm going to guess around 6-12 months for a whole cycle, but its a wild guess based on no data whatsoever and in the data mining book I'm reading, you shouldn't really guess with no data.



22.9.09

MySQL2GoogleSpreadsheets

Posted by Jonathan |

I've managed to find a way to connect MySQL directly to Google Spreadsheets and although that its not yet perfect, it does show a lot of potential.

You will have a MySQL table directly fed into Google Spreadsheets. From there, you could do some charts, highlights some trends or simply share the data as is with our people in our out of your organization is a secure way.

The end result should look something like this:


What you will need:
A Linux Server with Apache and MySQL
A Google Apps Premier account
Google Secure Data Connector installed on your Linux server


SDC
Installing the SDC is a bit tricky but not too difficult and there is a lot of documentation as to how to do it. A lot of it is giving things the right permissions and configuring 2-3 XML files.

You can read more about how SDC works, but from my impressions, it runs on your Linux server and gets data from apache. It then sends that data in a secure way into Google Spreadsheets. So when you query your data, the URL you use is "localhost" which is as if you are on the server.

Apache
Again, its important to note that everything that SDC gets has to go through Apache, which gives you the option of static data (in the form of CSV or XML files) or dynamic data (from PHP for example). It also means that you should secure Apache to not give the whole world this data. You can do this with OAuth (and there are a lot of instructions as to how to secure Apache with SDC in the SDC documentation) or you could try restricting Apache to know that it will only server 127.0.0.1 since SDC runs on your server. Either way, you need to factor that in.

MySQL
The question you must be asking now is, since SDC connects to Apache, how to I get data out of MySQL directly? (hopefully in the future SDC might be able to query MySQL directly)

In order to achieve that, I have used MySQL CSV Storage Engine, which basically means that MySQL will store all the data from the table into a CSV file on the hard disk (I hope that you are already imagining the possibilities here).
All you would need to do is update that table with an INSERT or UPDATE statement and the CSV file gets updated.

Symlink
Now we need to connect that CSV file to Apache. For that, we can use a Symbolic Link in Linux and connect the CSV into Apache by placing the link into the directory Apache reads files from (where ever your index.html file is, usually /var/www/html).
Now another thing to remember is that you need to give Apache rights to see the MySQL directory and be careful about doing that. An option could be to create a new database which would create a new directory on the hard disk and to let Apache see that directory. Do try not to let someone that might possible hack Apache to also hack MySQL, but I will leave that kind of thinking to the security analysts.

SDC XML configuration file
You need to add this symlink as a new rule in the SDC configuration file. You can see some examples there inside the file. When adding a new rule, you need to remember that SDC accesses Apache from the same machine, meaning your URLs should start with http://localhost/

Google Spreadsheets
Go into your Google Spreadsheets and go the to first cell (A1) and type:
=importData('http://localhost/[Your File Here]')
Hopefully, you should see the data from the MySQL CSV table.
If not, you might need to debug it some more.


Problems with SDC
The installation process wasn't as smooth as we would have liked but we did try something new at the time with MySQL. But apart from that, there are 2 problems we have noticed with SDC:
  1. Refresh rates
  2. High number of rows
SDC caches extremely aggressively and when we wanted to make changes to our CSV file, we still saw the old data which confused us through out the process.
A trick to help solve this is, if you goto Google Spreadsheets and in the importData function you change "http://localhost/sample.csv" to "http://localhost/sample.csv?a"
This will force Google to get the new data and you can change the "a" to whatever you like.

A high number of rows (I tried 31k) will get Google Spreadsheets stuck. And when I mean stuck, I don't mean that their server crashes or anything, just that you would wait a long long time to get any results. So you should just show the summarized data you wanted to show.


Conclusion

I am under the impression that this is a relatively new technology for Google and its not 100% tweaked and finalized (it was launched around the end of April 2009). I would assume that it would be good to invest time in structuring it now (meaning, you would be an early adopter), while Google catches up.