Lessons Learned through working with Infobright

Infobright is a great database technology for reporting and analytics, but it also comes with its own quirks.
I'd like to share those quirks with you now in an effort to make transitioning to it smoother:

  1. No user-defined functions - no select my_special_calculations(col1) from table; - unless you have a tiny dataset, it is practically unusable.
  2. No Views - you can use views, but they will not use the infobright optimizer and will use the MySQL one which will make it very slow.
  3. coalesce(col1,'') will cause query to infinite loop - use ifnull(col1,'') instead. coalesce(col1,col2,'') should work ok.
  4. Do not use "funny" dates or dates that don't exist - it will return no results. For example, 31st of June doesn't exist as June only goes up to 30.
  5. Do not use "select * from" with infobright - too slow.
  6. Enterprise version only - UPDATE/DELETE/REPLACE will be slow and in the case of UPDATE, fragment the data on disk.
    1. I try to use DELETE and INSERT INTO SELECT FROM to update data, but I have been stung a lot if my deletes are too large - they take forever - and updating individual rows is slow as well. For short, try to update as little as possible of the data that is already in infobright and try to do it in small chunks. Bulk INSERTS or LOAD DATA INFILE are no problem.
  7. Do not use scalar sub-queries: select id,..... (select id from newtable) as new_id from very_large_table where.. - practically guarantees infinite loops (aka takes forever).
  8. Querying a large table (with very few WHERE clauses) AND adding many joins can still be slow. Try the 'star schema' approach of one large (fact) table with a lot of rows but a few columns and join it to a few smaller (dimension) table with a few rows but many columns. Very probable that the dimension tables are demoralized to reduce joins.
  9. Do not mix and match (infobright)brighthouse tables with myisam tables in your joins. Can really take forever. (Infobright doesn't come with innodb....yet)
  10. Try to use queries like - Select date, sum(col1), avg(col2), max(col3) from yourtable where... for best results. For example
Select sum(units) from our_one_tera_byte_table where id = 4
+------------------+
|     154964394949 |
+------------------+
1 row in set (19.07 sec)
A row-based database cannot give you a result that fast.


Some more additions:
11) There was an issue with getting 'time' from tables and doing a datediff on them. Example:
ib1@360reporting@16:29:20> SELECT c.start_time, DATEDIFF(DATE(c.start_time), DATE(NOW())), DATEDIFF('2014-01-01', DATE(NOW())) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 3;
+---------------------+-------------------------------------------+-------------------------------------+
| start_time          | DATEDIFF(DATE(c.start_time), DATE(NOW())) | DATEDIFF('2014-01-01', DATE(NOW())) |
+---------------------+-------------------------------------------+-------------------------------------+
| 2014-02-17 11:43:45 |                                    682088 |                                -148 |
| 2014-03-26 00:00:00 |                                    682125 |                                -148 |
| 2014-01-03 00:00:00 |                                    682043 |                                -148 |
+---------------------+-------------------------------------------+-------------------------------------+
3 rows in set (0.01 sec)

Solution was to use dateformat on it. Example:
SELECT c.start_time, DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 1;                             +---------------------+-------------------------------------------------------+
| start_time          | DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) |
+---------------------+-------------------------------------------------------+
| 2014-02-17 11:43:45 |                                                  -101 |
+---------------------+-------------------------------------------------------+
1 row in set (0.00 sec)

12) Doing SELECT ... GROUP BY.. WITH ROLLUP (without ORDER BY) crashes your infobright server.. on linux and windows.




Useful Reporting Queries for Your Business

I have rehashed a list of useful reporting queries from my blog which I hope people would find useful for their business or clients. These are not MySQL specific, by any means.

Orders

  • What is the total revenue (by segments)? - You are looking for 'peaks and troughs'. For example, found a peak season? you can plan better for it next year. Found an off-season? perhaps offer some deals for that period.
  • What is the total placed orders, completed orders and abandoned/uncompleted orders?
  • What is the total paid revenue vs outstanding orders? 
  • What is the total number of returned/cancelled orders and lost revenue from them?

Products (Best Sellers)

  • Which are the top 20% best selling products? (businesses love 80/20 comparisons and if you look at reports a lot, you will too)
  • Which are the products that our best customers keep buying from us? - the library example - don't just keep the best selling books, like any supermarket would have. Have a range of books that attract your best customer.
  • Which are the bottom 10-20% worst selling products? - you may want to discontinue these, depending on your top customers
  • Which products generate the most revenue? – (product price * quantity)
  • Which products generate the most profit? – ((product price – product cost) * quantity) 

Customers

  • How many new accounts/customers do we get (each day, week, month, half year or year)?
  • Who are our best customers – that buy the most from us? - The top 20% that make 80% of the profits.
  • What is our attrition rate – the numbers of accounts/customers we lose (each day, week, month, half year or year)? - If your attrition rate is high, the sales team needs to find a lot of new people just to keep things the same.
  • What is the customer satisfaction? – You will probably get this from surveys or customer feedback.
  • What are the top reasons for complaints?
  • How many existing customers referred new customers to your company? – You might need to setup something to collect this data. If you send your customers to do your marketing for you, you can save a lot of money.

Website

  • Which product pages get the most views?
  • Which news/blog articles get the most views?
  • How many minutes does the user spend on the website?
  • What is the percentage of abandoned shopping carts?
  • What is the average amount of time a customer spends in the shopping cart/session?
  • Most important – What is the conversion rate of the visitors coming to the site and the people who purchase a product?
  • Which landing pages are the most viewed?
  • Which landing pages have the highest conversion rate?


Again, I hope these are helpful and if you would like me to add some of your own, please comment below.



Imagine CPUs will not get any faster

http://bit.ly/15fcYLt

I read an article recently, about how mobile apps will probably not get the hardware boost that people are expecting.

This is partially to do with that CPU performance hitting a sort of (heat) wall and cannot improve on their speed. As Linley Gwenapp said “we’ve been falling behind Moore’s Law ever since Intel hit the power wall back in 2005”.


I myself have noticed that on a few occasions when companies decided to buy an expensive machine for their main server, it turned out to perform slower on queries than their previous soon-to-be-updated machine.

In a recent example, a three year old server with CPUs that have 2.66Ghz clock speed was almost twice as fast as a brand new machines with CPUs that have 2.3Ghz clock speed. I'm not exactly sure, but the new machines probably have several times more cache on the CPUs, probably better instruction set and the hosting company swore that it is several times faster than the old machine. However, our results - specifically to MySQL - have been discouraging.


After reading the article, I would like to suggest a thought exercise:
As DBAs, what would happen if CPUs never improve. As in, their clock speed never improves.
They can probably add more cores, fit in more cache, maybe even double the size of the CPU on the motherboard. However, their basic core performance for single threaded applications would not improve.

What would you do?

How would you solve your current company's needs?

How would you solve your future company's needs in the face of issues such as Big Data?


In my opinion, MySQL will need to break up anything that needs to be single-threaded as much as possible. This would probably not be easy. Adding a Map/Reduce layer to MySQL may help this - it works for other commercial database vendors: Infobright, Greenplum, (I think also) Oracle.
(I am not sure if Oracle may be inclined to improve MySQL's processing of large amounts of data as it may hurt profitable parts of their business.)

Sharding can and has helped companies solve this problem. This breaks up the problem by having the single threads process less data per shard. I am not sure about the available and mature solutions there are if you need to group data across several shards.

Regarding hardware, there is certainly room for "SQL" chips (think Kickfire) and other FPGAs.
Hardware compression could help, especially compression that can spread across cores, but the actual processing of the data after decompression would still be single threaded.

Summary tables could very well help for certain workloads as they pre-process large amounts of data for you into more manageable sizes. In addition to using Hadoop and if you have a person that can model data properly, it can be a very long term solution.
Perhaps pre-processing would be a much bigger thing in the future. As in, you speed your queries now by preparing the answers ahead of time and caching them.


I would like to hear more approaches to solve this problem, but I would prefer the solution to lean on the side of 'tried and tested'.