26 Mar 2008

Yet Another Idea for GSoC - New Datatypes

Well, as most of you may know, Google Summer of Code is nearing and I have seen many
MySQL blogs urging people to contribute ideas or mentor projects.


My Idea
So I'm going to suggest an idea and as the title already mentions, I think it would be a good idea to add some new Datatypes. Specifically, a new Email datatype.

Now I already hear you saying "Email is a string, why would you need a datatype for it?", but hear me out.

So basically, I personally, have been dealing a lot with trying to clean email addresses and customer data so that my company can help their biggest client use this data in their new CRM system.

I thought it would have been a good idea to implement an email datatype inside MySQL that already has the standard regex for emails and has a defined varchar for the standard of emails.
Later, I became aware that there can be different types of emails addresses.
For example, you can have the standard latin a-z 0-9 with . and _ in them.
You can have an email with the european a u o with the 2 dots above them.
You can also have emails with characters like / in them for example 24/7@domain.com (someone that posted on my blog mentioned that one).

To solve this, you could have an Email datatype like this:
Email(latin)
Email(german)
Email(utf-8)


The Benefits
So what would be the benefits of using this kind of datatype?

  • Its out of the box solution for a common problem.
    • MySQL is used by many many websites and most of them record email addresses, so its relevant to them.
    • You wouldn't need to develop something yourself, go look at MySQL forge for snippets, download and use any libarys, even bother to read about it. Its just there provided for you in the database.
  • It helps improve data quality. After you have many tens of thousands of emails, it would have been nice to have more data quality. Thats not to say that you can develop some solution on the front end with Javascript for example, but it would still be nice to just have it in the database.
  • You wouldn't need to write webinars telling people how to optimize or shrink their email field in the database. Maybe even you can store it better on the harddisk for queries if you know it will be used for email address, who knows.

Downside
The downside would be, that you will get a database error if you didn't implement any checks when the user is entering their email address on their website. For this you might want to make your email field a string and worry about it later. But I assure you, later will be for more annoying.


More Options
Now, you can also say "Well, you can also have a datatype for phone". Well, you can.
Maybe you can have a phone datatype like this:
phone(US)
phone(world)

Depends what you want, but I'd like to start with email addresses. Maybe you can throw in there ip(raw), ip(full)...
I saw a webiner from Jay Pipes writing about that to improve query lookup speed specifically for ip addresses.


Conclusion
I think it will be useful for many MySQL users to have an out-of-the-box solution for a very generic problem in the form of new datatypes that come with their own validation rules.

Please write your thoughts in the comments. Your opinion is valued.


Thank you for reading my blog.

3 comments:

  1. Personally I'm all for the a data type GSoc project. But I'd rather see something like "pluggable domains"

    (domain vs data type: a data type has at least in mysql a ring of a particular storage requirement to it. You know - a tinyint is 1 byte, an in is 4 bytes etc. On the other hand data types have some semantics, such as the date/time types.)

    What you seem to want to achieve is not so much another type of storage for email addresses, you just want a nice way to say, "this column holds business specific type x and should be validated as such"
    ANSI SQL has a so-called concept of a domain, which is basically a named data base object that refers to a data type (or another domain) and adds semantics to it (like the name and/or constraints)

    So I'd rather see that then yet another primitive data type.

    As for your email checking woes: well, I think you should've looked at the specs. The syntax for an email address is defined in one of the RFCs. A quick google lead me to believe 822 is the actual RFC but there maybe newer RFC's that obsolete it.
    The "standard regex" you refer to looks nice, but according to
    http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
    it doesn't quite cut it. So I guess this is a case of youre mileage may vary.

    For the phone numbers, that's a similar case. I would not fancy a hard-wired data type implementation in the server code base only to find out that the phone system is due to change in a couple of years.

    just my 2ct

    ReplyDelete
  2. I agree with Roland, that's something I'd like to see pluggable. There can and should also be some way to define allowable operations -- for instance, an "age" field would never be incremented by more than 1 at a time, but a "salary" field could be. It's a small (perhaps bad) example, but different *domain* uses of the same data type lend themselves to different operations.

    ReplyDelete
  3. I usually tend to sanitize the input before it gets into the database. Your "email" data type also seems better suited for a check constraint (which could be in turn made complex enough to allow for very complex checks).

    ReplyDelete