Comments on DB Performance: SQL is in Fact a Programming Language

i come from a sybase/db2 programming background. w...

2008-07-15T16:32:00.000+01:00

i come from a sybase/db2 programming background. we use the entity attribute value pattern very very extensively. there is a very simple way to get around complex queries. use the pattern

select entity.id,
case when attribute.attrib_id = 1 then attribute.attrib_value end as attrib1,
case when attribute.attrib_id = 2 then attribute.attrib_value end as attrib2
from entity e, attribute a
where e.entity_id = a.entity_id
and (your conditions here)

im not sure if the above is possible in mysql.

btw we use the above in entities and attributes having upwards of 50 million rows (and more than 20gb each) without any problems. if you have proper indexes queries are fairly fast.

we have very fast apps running on top of the above tables.

i wouldn;t say the above is an antipattern. i think the problem here is ignorance of techniques in sql.

thx,
ks

"My short answer is: if you can't describe your da...

2008-04-04T13:04:00.000+01:00

"My short answer is: if you can't describe your data, then SQL isn't the right tool to query it."

Well, in this case, the data can be perfectly described, just not at the same time as the database schema was designed.

Whether it is "highly normalized" or "not normalized at all" is a matter of perspective.

From the point of view of the developers of the original application, that could not see whatever custom attributes the users would be adding, this EAV pattern is a normalized design to solve the problem of custom attributes. The could also have created a bunch of "spare" or "extra" columns in the existing tables which would depending on what they would be used for hardly be better normalized.

On another level, it is more arbitrary than it may seem whether something should be considered to be unnormalized or 1st normal form.

For example, take phone numbers. In some cases, we don't mind the phone number having a structure of it's own and we choose to regard it as an atomic value. In other cases, we want to separately store and search parts like country and area codes. Similar cases are URLs and email addresses.

For multivalued attributes, we can witness the same. Sometimes it might be necessary to store someones initials. Although we know that each letter stands for a proper name, we will usually choose to treat it as an atomic value.

In the end, the requirements of the application will govern what will be considered an atomic value. There is no objective generic criterion that can tell us that.

Yes, adding an extra table (or tables) to duplicat...

2008-04-04T04:29:00.000+01:00

Yes, adding an extra table (or tables) to duplicate the data is an important first step.

The next step is to DROP the EAV table! :-)

You shouldn't describe EAV as "highly normalized." In fact, the opposite is true - EAV is not a normalized design at all.

It's not even in 1st normal form, because it uses the attr_value column for all attributes in the relation. In a normalized design, each logical attribute should be stored in a distinct column.

@bill karwinI saw that you will be speaking in MyS...

2008-04-04T02:22:00.000+01:00

@bill karwin
I saw that you will be speaking in MySQL conference and I have to say that your topic seemed the most interesting to me :)
So please link the audio/video/slide files somewhere so I can see afterwards.

Regarding the design of this database, I just call it "highly normalized". The people who made the database had to factor in many different circumstances, locations in which it could be deployed and languages that it could be used with. So they opted to do it this way.

My main problem is that if you want to produce reports of some kind, its very very slow.

I worked somwhere else where they had very much the same problem and it was my task to make it go faster.

So I suggest just having an extra table (or atleast view) that duplicates the data from many smaller tables and integrates them.

On topic, to do this in java would probably taken 10x more lines to develop.

My short answer is: if you can't describe your da...

2008-04-04T02:18:00.000+01:00

My short answer is: if you can't describe your data, then SQL isn't the right tool to query it.

Sometimes you need semi-structured data, or you need the capability of adding attributes without changing the application code or the database metadata.

In those cases, you are asking for a greater amount of flexibility than SQL was designed to provide. Every programming task makes certain assumptions to achieve simplicity, and SQL makes the assumption that a given entity has a stable set of attributes.

EAV breaks this assumption -- it's designed to accommodate dynamic attributes. But it should be no surprise that the resulting SQL isn't very graceful.

There are emerging standards that treat data and metadata as virtually the same thing, and allow both to be dynamic and queriable. RDF/XML is one example, and there's even a standard query language for it -- SPARQL.

I'd say we're 10+ years years away from having de facto SPARQL interfaces as common as SQL interfaces, and RDF data stores that are as efficient as RDBMS products are today.

In the meantime, EAV can be used for semi-structured data. Though we must recognize it comes at a high cost, and perhaps in the majority of cases where it is used, it'd be simpler to add attributes with ALTER TABLE ADD COLUMN.

Other options include a hybrid between conventional and EAV tables. Store common attributes in the conventional way as columns, but allow extended attributes to be stored in an EAV table referencing the conventional table. At least this allows some queries to avoid the EAV table, if the query only needs to reference the base set of attributes.

Another option is to use "subtables" that extend a parent table, just like an object-oriented derived class adds properties and methods to those inherited from its parent class. It takes a join to reconstitute the full row, but it's better than EAV which requires a join per attribute.

All the solutions naturally add complexity, and therefore you should evaluate them on a case-by-case basis.

Hi!yes - definitely an Enity Attribute (anti)patte...

2008-04-03T21:18:00.000+01:00

Hi!

yes - definitely an Enity Attribute (anti)pattern.

This design should be shunned because of how cumbersome and opaque it is to code against it. All the ID's in the join clause to pick the right attributes are a tell-tale.

But still...what if you have an application like MS Project: it lets you add custom properties to tasks, resources, etc. And, I've done a few projects where we really needed that to calculate custom performance indicators for project progress. Under the hood, MS Project has a entity value module in the database schema to implement this, and indeed - you get these typical queries, which sucks.

But what about the alternatives? Surely, MS Project could just add a new column for a custom property, but the problem is that there are different types of projects and we'd end up with many columns that are basically useless to most types of projects. So that does not seem a good solution. Alternatively, a separate table could be created to hold the custom properties for a single project or group of projects of a certain type.

The disadvantage of modifying the schema to accomodate custom properties would mean that all users that save the project in the db would need to have the privileges to ALTER or CREATE tables. After that, access to those new objects must also be granted to all other users that might need to read those projects.

Not at all impossible, but still, application users with the rights to create database objects....

Another option would be to indeed store all those custom things in some kind of semi-structured format such as XML. But would that really make it better? I seriously doubt it - esp. once we need to combine date stored in custom properties with other data in the db, I think I'd stil rather settle for an entity-attribute approach - rather that than springling XQuery / XPath constructs in my queries to get it done - I mean, it seems a cure that's worse than the disease.

So - what are your thoughts? How to deal with partially extensible datamodels/ structures in relational databases? Should we just give up, or put up with a number of these annoying edges? I really don't know...

LOL! Yes, SQL is a programming language. And lik...

2008-04-03T19:22:00.000+01:00

LOL! Yes, SQL is a programming language. And like most programming languages, there is a graceful way to do design and architecture and then there is the way that I call "blecherous" (it makes you say "blechh!").

As soon as I saw the fourth field in your select-list I knew what was up:

lastname.CONATTR_VALUE

I feel sure this is an instance of a common SQL design blunder called Entity-Attribute-Value. It's not your fault they designed the database this way, but now you can see how difficult it makes your job using that design.

People who design their database with Entity-Attribute-Value practically don't need a SQL database at all -- they need an XML file.

I'm currently writing a book called "SQL Antipatterns" and EAV is high on the list.

I'm also presenting "SQL Antipatterns" as a tutorial at the MySQL Conference on April 14.
http://en.oreilly.com/mysql2008/public/schedule/detail/1639

More details here:
http://karwin.blogspot.com/2008/03/speaking-at-mysql-conference.html