17 Apr 2008

My First Data Warehouse

I finally finished my first data warehouse! and it only took me 3 days!

Well, to be fair, the data warehouse design was already planned and it wasn't really that big anyway, but I am still happy about it.

I was asked on Monday to do a data warehouse for my company's head quarters in Germany. I work in Beijing, so its like.... very slow to connect to there. They gave me the database design, some SQL statements to generate a few dimensions and "rough" business rules for the data.

Now, I haven't done anything like this before, but I really wanted to try. So I did it my way.

My way is to use a lot of Views with long SQL statements instead of cursors or stored procedures. I like it this way, because I feel like I can see the data and catch problems instead of programming blindly to find out later that there is a big problem. So basically, for me, its more comfortable.

I took the SQL statements that loaded the dimensions, cleaned them a bit and put them in Views. The Views in Oracle (we used Oracle for this data warehouse. I told them MySQL is better, but they preferred Oracle)... the views are very restrictive. I thought it was good, because it made me confident that they will work properly. However, Views don't allow for UNION, so for that I used Materialized Views.

Materialized Views was something I learned while doing this data warehouse and I think they are very cool, to be honest. It allowed me to create a sort of temp table that more or less needed no code from me to maintain. I can refresh it with 1 line in a stored procedure. I could also schedule it to refresh, but I didn't learn how to do that. I also used Materialized Views to replace views that needed too much calculations and I got REALLY annoyed waiting for it finish from the server in Germany.

I checked my SQL statements with looking at the Views and using a lot of count()'s on many different columns. Count()'s don't count NULL values and NULL is something that cant go into a fact table of a data warehouse. I also brushed up on my AND/OR conditions when I had missing data from one table and I had to find it in maybe another table. After that, I took this value and INNER JOIN'd it with another table. So an example that worked well for me is:
left join src_dim_team_all dt on ((u2dt.TEAM = dt.TEAM)) or
((u2dt.TEAM is null) and (dt.TEAM = f.TEAM))
In this example, the join takes the data from the first table.. or.. if the first table's value is null, it takes it from the second table. A simple OR didn't work for me and I only found that out when I tested it with some count()'s.


In the end, 99.99% of the work was done using Views and 1 stored procedure to do the:
insert into mydimension from select * from myview
which made it very very easy for me to abstract the complexity into several Views.


At the end, the final extracting, transforming and loading was ridiculous. EVERY tiny thing took 5-10 mins to run on the German server and after that I had to test it if its correct.

However, I enjoyed this experience and it would be one more thing I can add on my CV.


Thanks for reading my blog.