Get the 95% for Your Index Prefix

I was playing around with an idea recently...
I wanted to find out, what is the lowest number of characters needed to satisfy 95% of the values in a column? 95% is to rule out outliers.

I plan on using this when I want to get a bit agressive with the indexes on a table that gets inserted to very often. But until now, I haven't had a good query to find it quickly.

So, I thought a bit and came up with the following query:

mysql> show create table filenames\G
*************************** 1. row ***************************
       Table: show_filename
Create Table: CREATE TABLE `filenames` (
  `id` int(11) NOT NULL,
  `filename` varchar(255) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `filename` (`filename`),
1 row in set (0.00 sec)

mysql> select min(cc.l) as ninty_five_percent (
select length(filename) as l, count(distinct filename) as c, @ccount :=  @ccount + count(distinct filename) as t from filenames, (select @ccount := 0) as foo group by length(filename)) as cc where cc.t >= (@ccount*0.95) order by cc.t;
|  ninty_five_percent    |
|                  48         |
1 row in set (34.67 sec)

I now can run:

ALTER TABLE filenames DROP INDEX `filename`, ADD INDEX `filename`(`filename`(48));

95% may not be the ideal percentage to use, but it probably will reduce the size of outlier values.
It can take me a while to get a good percentage that reduces the index size, reduce the INSERT/UPDATE overhead, but still give good response time to SELECT queries.
I'm going to try this query to help me get there.

Thanks to Shlomi Noach for his talk in Percona London that helped inspire this query.

1 comment:

  1. This query answers, "95% of file names are length X or shorter."
    To choose an index prefix, you need, "95% of file names differ in the first X characters."
    One solution is