24 May 2014

Lessons Learned through working with Infobright

Infobright is a great database technology for reporting and analytics, but it also comes with its own quirks.
Edit: Every single one of the columnar database technologies and even standard database technologies have quirks. Infobright does not have an unusually large number more than anyone else. Other columnar store vendors simply wont tell what their quirks are and you are left to discover them on your own. In some cases, they can be crippling or "maybe we need something else even though we've spent 0.X million euros on it" types of quirks.

I'd like to share those quirks with you now in an effort to make transitioning to it smoother:

  1. No user-defined functions - no select my_special_calculations(col1) from table; - unless you have a tiny dataset.
  2. No Views - you can use views, but they will not use the infobright optimizer and will use the MySQL one which will make it very slow.
  3. coalesce(col1,'') will cause query to infinite loop - use ifnull(col1,'') instead. coalesce(col1,col2,'') should work ok.
  4. Do not use "funny" dates or dates that don't exist - it will return no results. For example, 31st of June doesn't exist as June only goes up to 30.
  5. Enterprise version only - UPDATE/DELETE/REPLACE will be slow and in the case of UPDATE, fragment the data on disk.
    1. I try to use DELETE and INSERT INTO SELECT FROM to update data, but I have been stung a lot if my deletes are too large - they take forever - and updating individual rows is slow as well. For short, try to update as little as possible of the data that is already in infobright and try to do it in small chunks. Bulk INSERTS or LOAD DATA INFILE are no problem.
  6. Do not use scalar sub-queries: select id,..... (select id from newtable) as new_id from very_large_table where..
  7. Querying a large table (with very few WHERE clauses) AND adding many joins can still be slow. Try putting the large table in its own sub query:  select * from (select date, .. from largetable group by...) inner join othertable..
  8. Denormalize the smaller tables that you are joining on as much as possible - similar to star schema/dimensional tables
  9. Do not mix and match (infobright)brighthouse tables with myisam tables in your joins. Can really take forever. (Infobright doesn't come with innodb....yet)
  10. Try to use queries like - Select date, sum(col1), avg(col2), max(col3) from yourtable where... for best results. For example
Select sum(units) from our_one_tera_byte_table where id = 4
+------------------+
|     154964394949 |
+------------------+
1 row in set (19.07 sec)
A row-based database simply cannot give you a result that fast.


Some more additions:
11) There was an issue with getting 'time' from tables and doing a datediff on them. Example:
ib1@360reporting@16:29:20> SELECT c.start_time, DATEDIFF(DATE(c.start_time), DATE(NOW())), DATEDIFF('2014-01-01', DATE(NOW())) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 3;
+---------------------+-------------------------------------------+-------------------------------------+
| start_time          | DATEDIFF(DATE(c.start_time), DATE(NOW())) | DATEDIFF('2014-01-01', DATE(NOW())) |
+---------------------+-------------------------------------------+-------------------------------------+
| 2014-02-17 11:43:45 |                                    682088 |                                -148 |
| 2014-03-26 00:00:00 |                                    682125 |                                -148 |
| 2014-01-03 00:00:00 |                                    682043 |                                -148 |
+---------------------+-------------------------------------------+-------------------------------------+
3 rows in set (0.01 sec)

Solution was to use dateformat on it. Example:
SELECT c.start_time, DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 1;                             +---------------------+-------------------------------------------------------+
| start_time          | DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) |
+---------------------+-------------------------------------------------------+
| 2014-02-17 11:43:45 |                                                  -101 |
+---------------------+-------------------------------------------------------+
1 row in set (0.00 sec)

12) Currently the best method is in fact to have everything (long strings and all) in one large table. The data is stored in a very compressed way and when generating, does not require infobright to create a temp table if using joins.



3 comments:

  1. Don't forget community edition. You can't even UPDATE or DELETE rows in that case. From my experience with Infobright they really don't want you to change rows after they are written - it /really/ slows things down. I think this is because UPDATE and DELETE are not in-place, and instead modify the knowledge grid so DELETING doesn't really free up any space, just "delete marks" the row in the KG (knowledge grid). Please correct me if that is untrue.

    The reason your last query is fast is because Infobright can answer questions from the KG instead of having to decompress the columns in the table. SUM, COUNT, etc, can often be answered from the KG itself instead of reading the table. Each "packet" in the KG contains the min/max/avg/etc for the packet so the KG can be used instead of the actual data.

    ReplyDelete
    Replies
    1. One more thing: you mention star schema are OK, but Inforbright really recommends one big flat denormalized table with no joins. They always complain that my benchmarks have joins :)

      Delete
    2. @Justin

      So DELETEs/UPDATEs and fragmentation is a topic onto itself. What I understood from infobright that is you update rows individually, then you will cause fragmentation. But if you delete a large or very large chunk of data then fill it with new data, then you do not fragment the table.

      I'm very familiar with the knowledge grid. Technically, my example didn't fully use the KG, because then the reply would have been sub-second. FYI, infobright EE also have approximate SELECTS or rough SELECTs which purely get data from the KG in the event you just want some results fast - http://www.businesswire.com/news/home/20140219005483/en#.U4IvSvldV8E

      Infobright and other columnar store vendors (I know Vertica also) likes to have just a very big table. Infobright EE has something called domain expert to shrink strings even further. But it was not practical for my case. If I remember correctly, I had a situation where 1 id had 10 text fields that were needed in some report and that was data that was added onto the original log ingest.
      And if you throw in the need to do updates, it helps my use case more to have it in a star schema.
      However, if you have just machine data/logs and only need to analyse that, then by all means, keep it in one large table in infobright.

      Delete