Email validation again

Saturday 7 January 2006This is nearly 19 years old. Be careful.

OK, so I’m a liar.

On Thursday, I updated my email validation code in response to an problem a reader was having with it. In the comments, as I expected, I was chastised for excluding some valid but unlikely addresses. I explained why I was not going to support those addresses. Then I went ahead and supported them.

Here’s what I said to defend not supporting address with spaces in them:

I know the value of following a standard to the letter, and how it improves interoperability, and so on. I also know the value of spending time on the things that will truly make a difference. When people have complained here about actual email addresses that didn’t work, I fixed the validation. I’m a little interested in supporting esoteric forms, but not much.

As it turns out, I am interested enough in supporting esoteric forms that I went ahead and did it. I guess once the idea of doing a better job was planted by Rik and Ben, I couldn’t resist. It was like an Everest to climb because it was there. So now quoted email addresses and escaped characters are accepted. Here’s the current code:

/**
 * Match an email address.  Return 1 if it's a syntactically valid
 * address, 0 otherwise. The $matches output parameter is an array
 * with 'local' and 'domain' entries if the address was valid.
 */
function MatchEmail($email, &$matches)
{
    $atomchar_re = "[a-z0-9!#$%&'*+\\/=?^_`{|}~-]";
    $escape_re = "(\\\\.)";
    $word_re = "(" . $atomchar_re . "|" . $escape_re . ")+";
    $local_re = $word_re . "(\\." . $word_re . ")*";
    $email_re =
        "/^" .
        "(?P<local>" .
            // Local part is atoms or escaped chars, separated by dots
            $local_re . "|" .
            // or a quoted string
            "\"[^\"]*\"" .
        ")" .
        // Need an at-sign!
        "@" .
        "(?P<domain>" .
            // Domain is anything ending with a dot and 2-4 letters.
            "([a-z0-9.-]+)(\\.[a-z]{2,4})" .
        ")" .
        "$/i";

    // Try to match the email address
    return preg_match($email_re, $email, $matches);
}

/**
 * Determine (as best as possible) whether the email address is valid.
 */
function IsValidEmail($email)
{
    // Presume that the email is invalid
    $valid = 0;

    // Validate the syntax
    if (MatchEmail($email, $matches)) {
        if (function_exists("getmxrr")) {
            $domaintld = $matches['domain'];
            while (substr_count($domaintld, ".") > 0) {
                // Validate the domain
                if (getmxrr($domaintld, $mxrecords)) {
                    $valid = 1;
                    break;
                }

                // Didn't find an MX record.
                // If we have a subdomain, move up the hierarchy.
                list($dummy, $domaintld) = split(".", $domaintld, 2);
            }
        }
        else {
            // Couldn't check the domain with getmxrr, assume the best.
            $valid = 1;
        }
    }
    else {
        $valid = 0;
    }

    return $valid;
}

One of the difficulties in writing code like this is just wading through the dense RFC’s that define the syntax. A document pointed to in the comments by Ben Finney was very helpful: RFC 3696 summarizes the rules in English.

Invaluable while making changes to impenetrable regular expressions are unit tests which both prove that the code works properly, and prove that the code still works properly. That is, they serve both as functional tests and regression tests. I wrote some of those too, so I really think this code works:

OK: joe@example.com matches: local is joe, domain is example.com
OK: joe@sub.example.com matches: local is joe, domain is sub.example.com
OK: joe.shmoe@example.com matches: local is joe.shmoe, domain is example.com
OK: joe+shmoe@example.com matches: local is joe+shmoe, domain is example.com
OK: joe.shmoe.hello_there@example.com matches: local is joe.shmoe.hello_there, domain is example.com
OK: joe.@example.com doesn't match.
OK: joe..shmoe@example.com doesn't match.
OK: .joe@example.com doesn't match.
OK: joe doesn't match.
OK: joe@joe@example.com doesn't match.
OK: joe shmoe@example.com doesn't match.
OK: joe\shmoe@example.com matches: local is joe\shmoe, domain is example.com
OK: doesn't match.
OK: @@ doesn't match.
OK: @example.com doesn't match.
OK: joe@ doesn't match.
OK: joe@127.0.0.1 doesn't match.
OK: joe'shmoe@example.com matches: local is joe'shmoe, domain is example.com
OK: joe\ shmoe@example.com matches: local is joe\ shmoe, domain is example.com
OK: joe\@shmoe@example.com matches: local is joe\@shmoe, domain is example.com
OK: "joe shmoe"@example.com matches: local is "joe shmoe", domain is example.com
OK: "joe@shmoe"@example.com matches: local is "joe@shmoe", domain is example.com
OK: ""@example.com matches: local is "", domain is example.com
OK: joe@[72.9.232.138] doesn't match.
OK: joe@joe\@com doesn't match.

25 tests, 0 failures

By the way: I am not validating these email address so that I can be sure mail will be delivered. Unless you ask for email notifications, I never send email to these addresses. I validate them to prevent spam and discourage anonymous comments. And yes, I know lame-o validation is a weak defense.

Any more complaints?

Comments

[gravatar]
No complaint. Just an observation.Email address validation is not just a weak defense. True, it causes spammers to use real-looking email addresses. Not their own addresses, of course, but quite possibly somebody else's real address.

Spammers don't differentiate between sites that just do validation, versus those that actually send an email for confirmation. They just deposit their spam and run. Their goal isn't to get every one of their spams posted. Their goal is to hit enough sites so that if a small percentage of their spams get posted it will get them a few hits. So even though your blog doesn't actually send email for confirmation, requiring a valid address here does influence what spammers will do on blogs that actually do send email for confirmation. If someone's server gets hit with a bogus confirmation email, they don't care. And if some actual user gets hit with a bogus confirmation email, they don't care.

I'm probably a bit more aware of the mis-use of real email addressses becasue mine looks like it could be the address of any of a thousand or more high schools. A lot of kids use my personal address as their "fake" address when posting to web sites. This is my problem, not yours, but my observation is that you really aren't discouraging spammers or anonymous commenters. You're just making them change behavior a bit in a way that, even if it makes them a bit less bothersome to you, has the potential side-effect of making them a little more bothersome to someone else.

-rich
[gravatar]
Rich, that's an interesting point, but frankly I doubt that there are many spammers adjusting their crap to get into my one-off blog. They're spraying their fire-hose regardless. I'm merely deflecting those that don't bother to use real email addresses.
[gravatar]
I should have emphasized that it is a collective "you", not a personal "you". I.e., of course there isn't a gaggle of spammer out there who will change their behavior just because Ned Batchelder validates email addresses, but if the majority of blog comment systems do validate addresses and only a small minority actually implement confirmations, then it is a very good bet that more spammers will use real-looking addresses. some of which will be addresses of real people.
[gravatar]
Personally I like the spam defeating measures here
[gravatar]
Nathan: that style of prevention is clever. Maybe I'll do something like that...

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.