23 Jan 2008

The Journals of a Self Confessed Data Warehousing Noob

Data Warehousing Continued

Part of the reason I was hired by my company was to help the company’s largest customer connect their data warehouse to the company’s website database. I have data basing experience but I have never been involved in such a large project (don’t tell my bosses). Luckily for me, I wont be doing everything, but for starters, I had to get some customer data, mainly email addresses and build a newsletter mailing list. Sounds simple enough..
Unfortunately for me, I had no idea how difficult it is to take even simple data fields from 2-3 sources and merge it into one. Some of the systems were so old, that back when they recorded the email address, there were no validation rules to enforce the input from the users. I saw emails like this http://www.163.com@yingying837.cn (163.com is a Chinese email server). It seemed to me that the people who inputted this email address only recently purchased a computer.

Anyway, after looking at the long list, running some filters and changing by hand some things, I figured I have 95% cleaned and the rest will be done with the email program to check if the email address still exists.

I was amazed how complicated it is to data cleanse old systems and afterwards make it comply with a data warehouse schema. Especially in the case of data that’s in more the one languages and even more so when it’s a language that uses characters like Chinese.


Conclusion

From now on I plan to put a strong emphasis on data quality from any data base I start working on.
I think it would be good to use data quality methods in my simplified business rules engine idea and this whole thing made me realize about possible directions my “new pet project” might go. I have already written a few things down about it on paper.

Thank you for reading my blog.