Lessons Learned through working with Infobright

Infobright is a great database technology for reporting and analytics, but it also comes with its own quirks.
I'd like to share those quirks with you now in an effort to make transitioning to it smoother:

  1. No user-defined functions - no select my_special_calculations(col1) from table; - unless you have a tiny dataset, it is practically unusable.
  2. No Views - you can use views, but they will not use the infobright optimizer and will use the MySQL one which will make it very slow.
  3. coalesce(col1,'') will cause query to infinite loop - use ifnull(col1,'') instead. coalesce(col1,col2,'') should work ok.
  4. Do not use "funny" dates or dates that don't exist - it will return no results. For example, 31st of June doesn't exist as June only goes up to 30.
  5. Do not use "select * from" with infobright - too slow.
  6. Enterprise version only - UPDATE/DELETE/REPLACE will be slow and in the case of UPDATE, fragment the data on disk.
    1. I try to use DELETE and INSERT INTO SELECT FROM to update data, but I have been stung a lot if my deletes are too large - they take forever - and updating individual rows is slow as well. For short, try to update as little as possible of the data that is already in infobright and try to do it in small chunks. Bulk INSERTS or LOAD DATA INFILE are no problem.
  7. Do not use scalar sub-queries: select id,..... (select id from newtable) as new_id from very_large_table where.. - practically guarantees infinite loops (aka takes forever).
  8. Querying a large table (with very few WHERE clauses) AND adding many joins can still be slow. Try the 'star schema' approach of one large (fact) table with a lot of rows but a few columns and join it to a few smaller (dimension) table with a few rows but many columns. Very probable that the dimension tables are demoralized to reduce joins.
  9. Do not mix and match (infobright)brighthouse tables with myisam tables in your joins. Can really take forever. (Infobright doesn't come with innodb....yet)
  10. Try to use queries like - Select date, sum(col1), avg(col2), max(col3) from yourtable where... for best results. For example
Select sum(units) from our_one_tera_byte_table where id = 4
+------------------+
|     154964394949 |
+------------------+
1 row in set (19.07 sec)
A row-based database cannot give you a result that fast.


Some more additions:
11) There was an issue with getting 'time' from tables and doing a datediff on them. Example:
ib1@360reporting@16:29:20> SELECT c.start_time, DATEDIFF(DATE(c.start_time), DATE(NOW())), DATEDIFF('2014-01-01', DATE(NOW())) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 3;
+---------------------+-------------------------------------------+-------------------------------------+
| start_time          | DATEDIFF(DATE(c.start_time), DATE(NOW())) | DATEDIFF('2014-01-01', DATE(NOW())) |
+---------------------+-------------------------------------------+-------------------------------------+
| 2014-02-17 11:43:45 |                                    682088 |                                -148 |
| 2014-03-26 00:00:00 |                                    682125 |                                -148 |
| 2014-01-03 00:00:00 |                                    682043 |                                -148 |
+---------------------+-------------------------------------------+-------------------------------------+
3 rows in set (0.01 sec)

Solution was to use dateformat on it. Example:
SELECT c.start_time, DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 1;                             +---------------------+-------------------------------------------------------+
| start_time          | DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) |
+---------------------+-------------------------------------------------------+
| 2014-02-17 11:43:45 |                                                  -101 |
+---------------------+-------------------------------------------------------+
1 row in set (0.00 sec)

12) Doing SELECT ... GROUP BY.. WITH ROLLUP (without ORDER BY) crashes your infobright server.. on linux and windows.




3 comments:

  1. Don't forget community edition. You can't even UPDATE or DELETE rows in that case. From my experience with Infobright they really don't want you to change rows after they are written - it /really/ slows things down. I think this is because UPDATE and DELETE are not in-place, and instead modify the knowledge grid so DELETING doesn't really free up any space, just "delete marks" the row in the KG (knowledge grid). Please correct me if that is untrue.

    The reason your last query is fast is because Infobright can answer questions from the KG instead of having to decompress the columns in the table. SUM, COUNT, etc, can often be answered from the KG itself instead of reading the table. Each "packet" in the KG contains the min/max/avg/etc for the packet so the KG can be used instead of the actual data.

    ReplyDelete
    Replies
    1. One more thing: you mention star schema are OK, but Inforbright really recommends one big flat denormalized table with no joins. They always complain that my benchmarks have joins :)

      Delete
    2. @Justin

      So DELETEs/UPDATEs and fragmentation is a topic onto itself. What I understood from infobright that is you update rows individually, then you will cause fragmentation. But if you delete a large or very large chunk of data then fill it with new data, then you do not fragment the table.

      I'm very familiar with the knowledge grid. Technically, my example didn't fully use the KG, because then the reply would have been sub-second. FYI, infobright EE also have approximate SELECTS or rough SELECTs which purely get data from the KG in the event you just want some results fast - http://www.businesswire.com/news/home/20140219005483/en#.U4IvSvldV8E

      Infobright and other columnar store vendors (I know Vertica also) likes to have just a very big table. Infobright EE has something called domain expert to shrink strings even further. But it was not practical for my case. If I remember correctly, I had a situation where 1 id had 10 text fields that were needed in some report and that was data that was added onto the original log ingest.
      And if you throw in the need to do updates, it helps my use case more to have it in a star schema.
      However, if you have just machine data/logs and only need to analyse that, then by all means, keep it in one large table in infobright.

      Delete