My Job in a Nutshell

(http://www.commitstrip.com/en/2014/08/01/when-i-help-a-rookie-coder-fix-his-queries/)

Lessons Learned through working with Infobright

Infobright is a great database technology for reporting and analytics, but it also comes with its own quirks.
I'd like to share those quirks with you now in an effort to make transitioning to it smoother:

  1. No user-defined functions - no select my_special_calculations(col1) from table; - unless you have a tiny dataset, it is practically unusable.
  2. No Views - you can use views, but they will not use the infobright optimizer and will use the MySQL one which will make it very slow.
  3. coalesce(col1,'') will cause query to infinite loop - use ifnull(col1,'') instead. coalesce(col1,col2,'') should work ok.
  4. Do not use "funny" dates or dates that don't exist - it will return no results. For example, 31st of June doesn't exist as June only goes up to 30.
  5. Do not use "select * from" with infobright - too slow.
  6. Enterprise version only - UPDATE/DELETE/REPLACE will be slow and in the case of UPDATE, fragment the data on disk.
    1. I try to use DELETE and INSERT INTO SELECT FROM to update data, but I have been stung a lot if my deletes are too large - they take forever - and updating individual rows is slow as well. For short, try to update as little as possible of the data that is already in infobright and try to do it in small chunks. Bulk INSERTS or LOAD DATA INFILE are no problem.
  7. Do not use scalar sub-queries: select id,..... (select id from newtable) as new_id from very_large_table where.. - practically guarantees infinite loops (aka takes forever).
  8. Querying a large table (with very few WHERE clauses) AND adding many joins can still be slow. Try the 'star schema' approach of one large (fact) table with a lot of rows but a few columns and join it to a few smaller (dimension) table with a few rows but many columns. Very probable that the dimension tables are demoralized to reduce joins.
  9. Do not mix and match (infobright)brighthouse tables with myisam tables in your joins. Can really take forever. (Infobright doesn't come with innodb....yet)
  10. Try to use queries like - Select date, sum(col1), avg(col2), max(col3) from yourtable where... for best results. For example
Select sum(units) from our_one_tera_byte_table where id = 4
+------------------+
|     154964394949 |
+------------------+
1 row in set (19.07 sec)
A row-based database cannot give you a result that fast.


Some more additions:
11) There was an issue with getting 'time' from tables and doing a datediff on them. Example:
ib1@360reporting@16:29:20> SELECT c.start_time, DATEDIFF(DATE(c.start_time), DATE(NOW())), DATEDIFF('2014-01-01', DATE(NOW())) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 3;
+---------------------+-------------------------------------------+-------------------------------------+
| start_time          | DATEDIFF(DATE(c.start_time), DATE(NOW())) | DATEDIFF('2014-01-01', DATE(NOW())) |
+---------------------+-------------------------------------------+-------------------------------------+
| 2014-02-17 11:43:45 |                                    682088 |                                -148 |
| 2014-03-26 00:00:00 |                                    682125 |                                -148 |
| 2014-01-03 00:00:00 |                                    682043 |                                -148 |
+---------------------+-------------------------------------------+-------------------------------------+
3 rows in set (0.01 sec)

Solution was to use dateformat on it. Example:
SELECT c.start_time, DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) FROM ctable c WHERE c.start_time > '2014-01-01' LIMIT 1;                             +---------------------+-------------------------------------------------------+
| start_time          | DATEDIFF(date_format(c.start_time,'%Y-%m-%d'), now()) |
+---------------------+-------------------------------------------------------+
| 2014-02-17 11:43:45 |                                                  -101 |
+---------------------+-------------------------------------------------------+
1 row in set (0.00 sec)

12) Doing SELECT ... GROUP BY.. WITH ROLLUP (without ORDER BY) crashes your infobright server.. on linux and windows.




Useful Reporting Queries for Your Business

I have rehashed a list of useful reporting queries from my blog which I hope people would find useful for their business or clients. These are not MySQL specific, by any means.

Orders

  • What is the total revenue (by segments)? - You are looking for 'peaks and troughs'. For example, found a peak season? you can plan better for it next year. Found an off-season? perhaps offer some deals for that period.
  • What is the total placed orders, completed orders and abandoned/uncompleted orders?
  • What is the total paid revenue vs outstanding orders? 
  • What is the total number of returned/cancelled orders and lost revenue from them?

Products (Best Sellers)

  • Which are the top 20% best selling products? (businesses love 80/20 comparisons and if you look at reports a lot, you will too)
  • Which are the products that our best customers keep buying from us? - the library example - don't just keep the best selling books, like any supermarket would have. Have a range of books that attract your best customer.
  • Which are the bottom 10-20% worst selling products? - you may want to discontinue these, depending on your top customers
  • Which products generate the most revenue? – (product price * quantity)
  • Which products generate the most profit? – ((product price – product cost) * quantity) 

Customers

  • How many new accounts/customers do we get (each day, week, month, half year or year)?
  • Who are our best customers – that buy the most from us? - The top 20% that make 80% of the profits.
  • What is our attrition rate – the numbers of accounts/customers we lose (each day, week, month, half year or year)? - If your attrition rate is high, the sales team needs to find a lot of new people just to keep things the same.
  • What is the customer satisfaction? – You will probably get this from surveys or customer feedback.
  • What are the top reasons for complaints?
  • How many existing customers referred new customers to your company? – You might need to setup something to collect this data. If you send your customers to do your marketing for you, you can save a lot of money.

Website

  • Which product pages get the most views?
  • Which news/blog articles get the most views?
  • How many minutes does the user spend on the website?
  • What is the percentage of abandoned shopping carts?
  • What is the average amount of time a customer spends in the shopping cart/session?
  • Most important – What is the conversion rate of the visitors coming to the site and the people who purchase a product?
  • Which landing pages are the most viewed?
  • Which landing pages have the highest conversion rate?


Again, I hope these are helpful and if you would like me to add some of your own, please comment below.