Humane email validation

Saturday 22 August 2009

My recent work on a consumer-facing product brought up the old problem: how to validate an email address before using it? There's a classic tension here between those developers that want to prevent typos from floundering around in the system, giving users feedback as soon as possible that it seems like they've made a mistake; and those developers that want to be sure that any valid email address can be used.

The usual advice on this matter is to not bother with validation, because it's a fool's errand, instead simply send an email to the address with a confirmation link. If the user clicks the link, then the address must have been valid.

I don't like this advice because the vast, vast majority of email addresses do validate with a simple regex, and the vast, vast majority of failures against the regex represent real mistakes, not obscure but valid email addresses. Catching user mistakes early is a good thing. Having the user wait for an email that will never come, then go back to enter their email address again is a pain.

This is the regex I used:

/^[^@ ]+@[^@ ]+\.[^@ ]+$/

In other words, an email address has to have stuff, at-sign, stuff, dot, stuff. The stuff can have dots in it, but can't have at-signs or spaces. And by the way, before matching against the regex, trim whitespace from the ends of the address.

As a gesture of reconciliation with the purists, I propose this: check the user-entered email address against this regex. If it matches, it's valid. If it doesn't match, show the user an "invalid email address" error box that has two buttons: "Fix mistake" which lets the user re-enter an email address, and "Use it anyway" which takes the email address as-is even though it failed the match.

This is the best of both worlds, since the common case of a catchable typo in an email address will force the user to double-check their entry, but any address can be used if the user knows what they are doing. Most users will never see the error box, since they'll enter their address correctly.

I've never seen a work flow like this, but it seems like a really simple solution to the problem. Is there something I'm over-looking? Is it too geeky?

Comments

[gravatar]
Jordan Sherer 8:39 AM on 22 Aug 2009

I think your proposed solution is an elegant one. Your simple regex solves two problems:

1. Thwarts user error when entering email
2. Reduces (re)development when email address syntax changes (if it ever)

I know I've seen a number of email validation regexes which filter the tld to between 2 and 4 characters. While most TLD's are in this range (.uk, .com, .name) there are instances where they are not (.museum, or .travel). It is naive of us to think that these "limits" (even though the ICANN says TLDs can be anything with 2 or more chars) are constant. Your solution doesn't care about limits. Like you said:

stuff = anything but at and space
stuff + at + stuff

[gravatar]
sil 9:28 AM on 22 Aug 2009

When I become king of the world I'm going to run a top level MX server and be sil@com just to annoy everyone who does email validation :-)

[gravatar]
Braden 10:03 AM on 22 Aug 2009

Amazon uses that workflow for mailing addresses.

[gravatar]
Edward Grefenstette 10:46 AM on 22 Aug 2009

I like the simplicity and elegance of your solution. However, you may also want to catch all whitespace in the address candidate, since your regex would recognise "john.doe @gmail.com" where " " is a tab character. Your regex can be modified simply by substituting the posix compliant ":space:" for " " in your character ranges, like so:
/^[^@:space:]+@[^@ ]+\.[^@:space:]+$/
Hope this helps!

[gravatar]
Peter Bengtsson 11:39 AM on 22 Aug 2009

@Edward: Does :space: include \r and \n? Isn't \s available in most regex engines?

[gravatar]
Edward Grefenstette 5:56 PM on 22 Aug 2009

@ Peter: ':space:' is the posix character class. '\s' was a shortcut for ':space:' in perl, which found its way into python and several other engines. They both do the same thing (and python, at least, takes both forms).

[gravatar]
Jess Robinson 1:40 AM on 24 Aug 2009

Any particular reason you disallow spaces before the @? That's actually valid.

[gravatar]
Matt Jones 5:06 AM on 24 Aug 2009

@Jess:
Spaces in the local-part don't appear to be allowed unless the local-part is quoted. (see RFC5322; local-part is either a dot-atom or quoted-string)

[gravatar]
Ned Batchelder 6:35 AM on 24 Aug 2009

@Jess: Remember, I'm not trying to strictly separate email addresses into valid and invalid. I'm aiming for a more common-man idea of what email addresses actually look like.

For example, on Facebook, where a user enters their email address, how often does a space appear before the @ as a valid address, compared to how often a space appears as a typo? I'm certain the typo wins out at least 100-to-1, probably more like 1000-to-1.

[gravatar]
jlecour 7:02 AM on 24 Aug 2009

Hi,

I've once worked on a website where quantity of very "basic" users can create an account.
I've used a quite complicated validator AND the receive-an-email-and-click-the-link process

I can't remember how often people typed valid e-mail addresses but with typos in it. They never received the e-mail and barked that it's not working. If they were not to expect a confirmation e-mail, they wouldn't even know there was a problem and the whole user-website relation would be nil.

Anyway, I'll try this solution in a real website to see it in a real case.

[gravatar]
Mark Dodwell 8:17 AM on 24 Aug 2009

I would say that it's likely to be closer to 1 in a 1,000,000 for people to have a space in their email address. Most email clients would probably choke on it anyway so it'd be fairly useless! Also, part of me thinks that of people have a space in their email address they just don't deserve to signup anyway :)

[gravatar]
David Björkevik 10:34 AM on 24 Aug 2009

I like your solution, except for "error box" part. I like the GUI trend to use less dialog popups and instead display information to the user in a non-invasive way.

In this case: just display a text next to the email entry saying that the email address looks invalid.

[gravatar]
Walter Smith 2:02 PM on 24 Aug 2009

I would add a [^@ .] before the @. (So something like /^[^@ ]*[^@ .]@ ...) I don't know why, but on our sites we see a lot of periods at the end of user names, which is guaranteed invalid. It's instantly rejected by Postfix, but it's nicer to catch it client-side. Also, trim whitespace from beginning and end before validating.

[gravatar]
Josh 8:02 PM on 24 Aug 2009

In a form with multiple fields, I like David's solution of flagging the error in context as you advance to the next field. But if the email address is the only text field, Ned's popup is nicer than the "it's wrong until you type it right" validation I see in some contexts.

[gravatar]
Justin 6:26 PM on 29 Aug 2009

Here's a better regex. Grabbed from the php.net docs:
^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$

[gravatar]
Jordan Sherer 7:05 PM on 29 Aug 2009

@Justin: Sadly that isn't "better". It fails to validate for any top level domain over three characters, so these are automatically out:

me@jordan.name
or
you@eu.travel

Expanding off of that, it's a simple fix. Just remove the three from the last limit: ^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,})$

But of course, this would fail to validate if the ICANN started to allow numbers or underscores (_) or any other special non-word character in the tld.

[gravatar]
Ned Batchelder 7:29 PM on 29 Aug 2009

@Justin and @Jordan: thanks, but I'm afraid you're both missing the point. You can make the regex arbitrarily more complex to get asymptotically closer to the RFC-defined syntax for an email, but it will never be perfect, and it doesn't have to be.

The whole point of this post is that we can neatly side-step the problem by using a simple regex that accounts for 99.9% of the real emails in use, and give the remaining .1% a button that says, "I know better than your stupid regex does, just let me use the email address".

[gravatar]
Jordan Sherer 8:20 PM on 29 Aug 2009

Agreed.

[gravatar]
Chris Babcock 12:44 PM on 1 Sep 2009

I use a simple regex for validation, but on the server side I do a DNS lookup on the domain part. I am using the TRE regex library and have no interest in testing some of the pathological Posix and PCRE regexps for use in my environment.

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.