17 Mar 2008

Open Source ETL tools vs Commercial ETL tools

Figure 1: Simple schematic for a data warehous...Image via Wikipedia
Recently I have been asked by my company to make a case for open-source ETL-data integration tools as an alternative for the commercial data integration tool, Informatica PowerCenter.
So I did a lot of research and I'm going to try my best, considering I have never used the open-source tools nor the commercial one.

I found plenty of information about comparisons between Pentaho Kettle and Talend, which were 2 of the open-source tools I was supposed to research.
Now, without getting in a big arguement (or matt casters posting on my blog), I'd like to attempt to compare the two, very briefly.
And again, this is ONLY from the research I did online and not based on my experience using the tools (since I dont really have any).


Pentaho Kettle vs Talend


Pentaho
Pentaho is a commerical open-source BI suite that has a product called Kettle for data integration.
It uses an innovative meta-driven approach and has a strong and very easy-to-use GUI.
The company started around 2001 (2002 was when kettle was integrated into it).
It has a strong community of 13,500 registered users.
It has a stand-alone java engine that process the jobs and tasks for moving data between many different databases and files.
It can schedule tasks (but you need a schedular for that - cron).
It can run remote jobs on "slave servers" on other machines.
It has data quality features: from its own GUI, writing more customised SQL queries, Javascript and regular expressions.


Talend
Talend is an open-source data integration tool (not a full BI suite).
It uses a code-generating approach. Uses a GUI, but within Eclipse RC.
It started around October 2006
It has a much smaller community then Pentaho but has 2 finance companies supporting it.
It generates java or perl code which you later run on your server.
It can schedule tasks (also with using schedulars like cron).
It has data quality features: from its own GUI, writing more customised SQL queries and Java.


Comparison - (from my understanding)
Pentaho is faster (twice as fast maybe) then Talend.
Pentaho's GUI is easier to use then Talend's GUI and takes less time to learn.


My impression
Pentaho is easier to use because of its GUI.
Talend is more a tool for people who are making already a Java program and want to save lots and lots of time with a tool that generates code for them.



Assuming Pentaho made it to the next round....

Pentaho Kettle vs Informatica

Informatica
Informatica is a very good commercial data integration suite.
It was founded in 1993
It is the market share leader in data integration (Gartner Dataquest)
It has 2600 customers. Of those, there are fortune 100 companies, companies on the Dow Jones and government organization.
The company's sole focus is data integration.
It has quite a big package for enterprises to integrate their systems, cleanse their data and can connect to a vast number of current and legacy systems.
Its very expensive, will require training some of your staff to use it and probably require hiring consultants as well. (I hear Informatica consultants are well paid).
Its very fast and can scale for large systems. It has "Pushdown Optimization" which uses an ELT approach that uses the source database to do the transforming - like Oracle Warehouse Builder.


Comparison
Pentaho's Javascipt is very powerful when writing transformation tasks.
Informatica has many more enterprise features, for example, load balancing between database servers.
Pentaho's GUI requires less training then Informatica.
Penatho doesn't require huge upfront costs as Informatica does. (that part you saw coming, I'm sure)
(edited)Informatica is faster then Pentaho. Infromatica has Pushdown Optimization, but with some tweaking to Pentaho and some knowledge of the source database, you can improve the speed of Pentaho. (also see line below)
(new)You can place Pentaho Kettle on many different servers (as many as you like, its free) and use it as a cluster.
Informatica has much better monitoring tools then Pentaho.


My Impression
Informatica is a really good enterprise ETL suite, but is very big and expensive.
If the system is small enough, I would rather give Pentaho a try and there are many many use cases where big companies used Pentaho (an airport, a hospital..).




Conclusion

I think matt casters said it best when he said:
The flood of open source software is going to wash away the proprietary ones..

If you want to add (or correct) to the information I wrote here, then please consider doing so, as I am still trying to understand these products myself.
Your opinion is valued.



Thank you for reading my blog.
Reblog this post [with Zemanta]

31 comments:

  1. Hi!

    Great Post!

    A few additions:
    - with kettle you don't have load-balancing but you do have database partitioning. That is, many steps support a distribution of the outgoing data over a cluster of databases which is great for sharding laaaaaarge data sets.

    Also, if your transformations are computationally heavy, you can cluster Kettle itself. So, you can fight increasing load times by throwing more (commodity) hardware onto your kettle cluster.

    Monitoring is indeed a bit limited but many things will be fixed in the upcoming version. See:

    http://www.ibridge.be/?p=92

    ReplyDelete
  2. Thank you Roland for the info and in particular, the link to the graphs. I know people in presentations like graphs very much.

    Also, I'd like to add, I found a price for informatica. They have a salesforce integration pack for 2500$ per month.
    http://www.salesforce.com/appexchange/detail_overview.jsp?id=a0330000001GTOYAA4
    There is also a very good demo video about how to use informatica in the same web page.

    ReplyDelete
  3. OK Jonathan, I promise not to post on your blog! :-D

    ReplyDelete
  4. Why didn't you consider Datastage?

    ReplyDelete
  5. I was only asked to research those particular tools. Sorry :S

    I was also supposed to research Inaport (http://www.inaplex.com/Products/inaport.aspx)
    that imports information to certain CRM software, but I didn't think it was relevant to this topic

    ReplyDelete
  6. Hi Jonathan,

    One thing you left out of your analysis is the number of connectors/adapters in each one of the data integration tools, as in the real world many companies have a wide variety of legacy applciations, many different types of databases, file formats, technologies, etc.

    Fernando A. Labastida

    ReplyDelete
  7. Hi,

    thank you for the valuable information - great article!

    I have some experience with other tools than informatica (datastage and sas for example) and also interested in the potential uses of pentaho kettle.
    I am currently preparing an article with a comparison of those tools.

    You may want to have a look here:
    http://etl-tools.info/en/examples/etl-solutions.htm

    Regards

    ReplyDelete
  8. Hi,

    I don't share your point of view. I tried several times (each major version) PDI but, always, I'm back to Talend. Talend is ALWAYS faster (sometimes >34x!) and as a much nicer GUI. For example PDI doesn't have any visual mapping tool as the tMap component from Talend...
    That's just my 2 cents!
    Regards

    ReplyDelete
    Replies
    1. U r right Nick. Currently i am playing with both PDI & Talend Open Studio..
      Talend seems to be much faster.

      Delete
  9. I completed a project just recently using the pervasive software etl tools. I was very pleased (i would have custom coded otherwise).

    is there anybody who compared all the etl tools available?

    ReplyDelete
  10. As nick said the Tmap component of Talend Open Studio is really convenient since it provides a graphical and functional view of integration processes that make ETL operations clearer for non-technical users.

    ReplyDelete
  11. Very interesting article. However, isn't it true that you are really only looking at the low end of the market here?

    I have been developing ETL solutions as a consultant for about 7 years now. Six of those years have been using the tool Ab Initio. And while I am a long time Open Source fan (and sometimes developer), I can't even remotely compare any of the tools you have mentioned to the power, flexibility and overall capability of Ab Initio.

    Now, I have used Informatica before (shudder, yuck!) and can definitely see how up and coming OS tools such as Pentaho and Talend may be able to take that market by storm.

    Anyhow, just some thoughts to consider...

    ReplyDelete
  12. I'm disappointed that Oracle Warehouse Builder didn't get more of a mention here, except for the comparison to INFA's "pushdown" feature (which really isn't directly comparable to what OWB does with in-database transformation etc.).

    I know, it's primarily (though not exclusively) Oracle-only, but it compares pretty favorably to INFA at a much lower price. And it has data quality options.

    ReplyDelete
  13. I'm disappointed that Oracle Warehouse Builder didn't get more of a mention here, except for the comparison to INFA's "pushdown" feature (which really isn't directly comparable to what OWB does with in-database transformation etc.).

    I know, it's primarily (though not exclusively) Oracle-only, but it compares pretty favorably to INFA at a much lower price. And it has data quality options.

    Disclosure: I'm one of the OWB product managers... so of course I'm a bit fond of my baby... but it's well worth a look vs. any open source option if your primary target database is Oracle.

    ReplyDelete
    Replies
    1. What about the Target System as HDFS? How does OWB measure there?

      Delete
  14. Good article. I'm looking into open source ETL solutions to augment an Informatica installation. All those Informatica add-ons keep adding up!
    I have been looking at Talend quite a bit, but will now give Kettle a good look.
    James

    ReplyDelete
  15. Hi James L,

    I know how difficult it is to choose an ETL. Especially when it is for your company. Open source is a good choice, proprietary solutions just keep adding up...

    Between both ETL tools, I would definitely go for Talend. The community is very active, many components are available on the Talend website, the program has a professional GUI designed for IT experts as well as business users. There are also different programs in the Talend product suite enabling you to perform data quality, profiling...

    ReplyDelete
  16. We have done a thourough evaluation of Kettle vs Talend. I cannot possible see how anyone would feel that Kettle can compare to Talend, after using Talend. Talend is commercial grade in stability, performance, transaction handling, extremely robust development environment, ease of use, and scalability. I can do with one tMap component in Talend, what takes several components in Kettle. What's more, Talend has over 400 components in my pallete to choose from. I can create reusable joblets, deploy secure .exe runtime code with a commercial grade admin console, and get ETL documentation generated automatically. For those looking for an ETL tool, choose Talend, and save yourself hundreds of hours of headaches. Talend should be compared to DataStage, SSIS, or Informatica, not Ketttle.

    ReplyDelete
  17. Hi,

    Can anybody tell me that which ETL tool other than Kettle can support the following different types of databases, file formats.
    1- Shape files
    2- ASCI files
    3- Access
    4- Excel

    ReplyDelete
  18. Great post. Good comparison which is required in the current ETL industry.
    I might add couple of more test cases for each of the tool. Specially in terms of data processing. The time it has taken for each of the tools and complexity involved to process the data.
    I tried using talend, but since i'm not from java background, it was hard for me. on the other hand, infa is more gui based, so easy for any person (ofcourse, training is required).
    To answer the question - which ETL tool other than Kettle can support the following different types of databases, file formats?,
    To my knowledge:
    1- Shape files -never dealt with this, but u can use custom transf in infa and use c language or can you java transformation. WIth Talend, i'm java can very well handle this.

    2- ASCI files
    Yes. Infa can, talend can
    3- Access
    Yes. Infa can, talend can, other tools also can do. since this is a small db.
    4- Excel
    Informatica can handle. talend also can with custom java code.

    ReplyDelete
  19. Hi,

    Thanks for your reply.
    Is there any Java API to handle shape files. Handling excel, access and flat files will be not a big issue but handling shape file is an issue for me at the moment.

    I tried geokettle tool that is handling shape files but I am unable to use it (any geokettle api) in my project to read shape file.

    any comments?

    Best Regards
    Tabbasum

    ReplyDelete
  20. There is also a pretty comprehensive comparison of ETL vendors here:
    http://www.adeptia.com/products/etl_vendor_comparison.html

    This includes competitive analysis for 40+ parameters for the following products: Informatica, DataStage, Adeptia ETL Suite, Pervasive, Pentaho and Talend.

    ReplyDelete
  21. Really concise and complete overview of ETL and BI tools. Great work and thanks for sharing.

    ReplyDelete
  22. Try Scriptella ETL. It's an open source (Apache licensed!) fast, powerful and simple ETL written in Java with an easy-to-use one line of code Java integration, also supports Spring.

    ReplyDelete
  23. Thanks, nice post

    ReplyDelete
  24. Very good post, I would like to have your opinion since Pantaho is acquired by Hitachi, and some internet information shown Talend with great progress in the past three years. Is that true?

    ReplyDelete
  25. thanks a lot for this nice article.
    I want to ask you please
    I'm going to develop a Java web application and I want to integrate data from different sources using an ETL .. which of pentaho (Ketlle) or Talend should I use ??
    Is there an API or jars that I can add to my application ??

    Thanks in advance :)

    ReplyDelete
  26. Thanks for this nice article

    I want to ask please: I'm going to develop a java web application and I want to integrate data from different sources.
    Is there an API or jars of pentaho or talend that I can add in my application to do that ??
    Or what shall I do ??

    Thanks in advance :)

    ReplyDelete
  27. I think you can speak to both pentaho and talend about doing that or even pay them to develop a customer facing BI suite with your logo on it.

    ReplyDelete
  28. Does anyone here use Sap Data Services

    ReplyDelete
  29. Does anyone here use Sap Data Services. How does it compare to Pentaho

    ReplyDelete