Redirects

Sunday 20 January 2008This is close to 17 years old. Be careful.

When I recently re-built this site, some pages moved. For example, long-ish blog posts used to get a separate page with a name based on a time stamp, but now all blog posts have their own page with a slugged URL, so those old numeric URLs are obsolete.

Rather than abandon those URLs, though, I wanted to make sure that inbound links to my site weren’t broken. To do this, I created a bunch of redirects. As I’m sure I don’t need to explain, a redirect is when a web server returns not an HTML page, but a pointer to another URL that the browser should automatically load instead. The HTTP status code returned with the redirect indicates more about why the redirect is happening. In my case, I want a 301: Moved Permanently.

To implement the redirects, I used two technologies: Apache’s mod_rewrite module, and PHP.

Mod_rewrite is a classic open-source tool: immensely powerful, and powerfully cryptic. I don’t understand most of it, but there are tutorials out there that helped a lot (one, two, three).

To program mod_rewrite, you add lines to the .htaccess file for your Apache installation.

The simplest change I made is that there is no longer a blog/rssfull.xml feed, so I wanted clients requesting it to be redirected to the blog/rss.xml feed instead:

RewriteEngine on 
RewriteRule ^blog/rssfull\.xml /blog/rss.xml [R=301,L]

The RewriteRule line does the work. It has three arguments: a regular expression for the URL pattern to match, a result to rewrite it to, and flags. In this case, the regex is simple, we need only ^ to anchor it at the beginning and a backslash to escape the dot, but other than that, it’s just the URL we’re talking about. The result is the URL we want to rewrite to. The flags are what make this a rewrite. R=301 means don’t serve the result URL, but redirect to it, and use a 301 status code. The L flag means this is the last rewrite to apply, so don’t bother reading the rest of the rewrite rules.

I also moved where my tag pages go. What used to be /blog/tag_books.html is now /blog/tag/books.html. Another rule handles these rewrites:

RewriteRule ^blog/tag_([-a-z0-9]+)\.html$ /blog/tag/$1.html [R=301,L]

Here we’re using more regex power to match a set of URLs, and the result includes $1 to pull in part of the match from the pattern. The result is the same, though: a 301 redirect so that the correct URL is served.

One other change I made recently was to canonicalize the host name for my site. Rather than have some references be www.nedbatchelder.com and others simply nedbatchelder.com, I want everyone to use the simpler version. I’m using a mod_rewrite rule to redirect any request with a subdomain (even www) to the plain host name:

# canonicalize the host name (no prefix)
RewriteCond %{HTTP_HOST} \.nedbatchelder\.com$ [NC]
RewriteRule .? http://nedbatchelder.com%{REQUEST_URI} [R=301,L]

Here I use another mod_rewrite directive: RewriteCond. This creates a condition which must be true for the next RewriteRule to match. Here the condition is that the host name must contain “.nedbatchelder.com” (NC means no case, a case-insensitive match). So any dotted prefix will trigger the rewrite rule, which uses .? as a pattern to match all URLs. The result uses %{REQUEST_URI} to redirect the browser to the same URI, but at the simpler hostname.

The last redirects I needed were for those longish blog pages. Here I couldn’t use a single mod_rewrite rule to do all the work, because there’s no pattern to the rewrites. /blog/20030616T212329.html had to become /blog/200306/paul_rands_geometry_books.html. Since there are about 130 of these posts, I could have added 130 lines to .htaccess, but that seemed excessive, and in my setup, it’s a hand-edited file, not generated, so it would have been a pain.

Instead, I used mod_rewrite to rewrite the incoming URL to a generated PHP file that had a table of redirects:

RewriteRule ^blog/200[0-9]{5}T[0-9]{6}.html /blog/moved.php

This was possible because the old URLs fit a pattern: the URL was created from an ISO8601 timestamp, so I could write a regex to match it, in this case a specific number of digits with a T in the middle. The rewrite this time is to /blog/moved.php, and it isn’t a redirect: Apache executes that file directly to come up with a response to the browser.

The moved.php file looks like this:

<?php
// Redirector for old extended blog entries to their new locations.

$oldnew = array(
    "/blog/20050908T142058.html" => "/blog/200509/amazon_sales_stats_grabber.html",
    "/blog/20071216T110049.html" => "/blog/200712/ancient_history_the_digital_logo.html",
    // ... omitting about 125 entries ...
    "/blog/20040716T065847.html" => "/blog/200407/windows_themes.html",
    "/blog/20041117T084310.html" => "/blog/200411/xml_schema_for_nonxml_data.html"
);

$path = $_SERVER["REQUEST_URI"];
$redir = $oldnew[$path];
if ($redir == "") {
    $redir = "/blog/index.html";
}

header("Location: $redir",TRUE,301);
?>

An associative array is used to map the incoming URL to the new URL. If nothing is found, default to the blog index page. Finally, use the PHP header() function to redirect the browser with a 301 status code.

After all of this, what have we got?

  • I’ve avoided breaking inbound links. As I was working on this redesign, I went back through some old notes of ideas to try, and about half the interesting URLs I tried were broken. I was sad.
  • My URLs are more canonical. This helps with search engines and link aggregators, to make sure I don’t get two entries when one will do.

Comments

[gravatar]
I don't know what particular dialect of regexes is supported by mod_rewrite, but couldn't
RewriteRule ^blog/200[0-9][0-9][0-9][0-9][0-9]T[0-9][0-9][0-9][0-9][0-9][0-9]\.html /blog/moved.php
be simplified to something like
RewriteRule ^blog/200\d{5}T\d{6}\.html /blog/moved.php
(Add backslashes to taste)
[gravatar]
George, it's possible it will work. The mod_rewrite docs don't mention the curly bracket syntax, but they say they are POSIX regexes. I've found a few pages claiming to document POSIX regex syntax. Some mention the curly braces, some don't. I went with a least-common denominator approach.
[gravatar]
Both \d (equivalent to [0-9]) and {m,n} (at least 'm' repetitions, at most 'n') are available in most regex engines now. They tend to be more efficient, and I find them more legible.

http://www.ilovejackdaniels.com/cheat-sheets/mod_rewrite-cheat-sheet/ indicates that mod_rewrite supports the {m,n} syntax.
[gravatar]
No need to use PHP at all since Apache can handle redirects on its own:

http://httpd.apache.org/docs/2.2/mod/mod_alias.html#redirect

With some clever searching and replacing you could probably convert your huge associative array in PHP to a .htaccess file with Redirects.

Not sure its worth the trouble though.
[gravatar]
George: that is a great cheat sheet, and I have changed the line to use the {5} syntax (\d still isn't mentioned, but [0-9] is no hardship, and is more familiar to me anyway).
[gravatar]
Michael: you are right, Apache could do this itself, I alluded to as much in the post. Maybe there's no need, but it just didn't feel right to have 130 entries of specific redirects in the .htaccess file.
[gravatar]
When I had to do this, I used RewriteMap. It takes a tab-delimited file containing all of the redirects. Much simpler than writing PHP.
[gravatar]
Roger: good point. I'd never noticed RewriteMap before. It's tailor-made for the problem I had to solve!
[gravatar]
Just playing devil's advocate, bear with me here...

Other than being "correct" with the 301, what's the point of a permanent redirect? If you truly want your links to never be broken, you have to leave your redirect there. If the redirect is always there, it wouldn't have to give a 301.

I agree that it's the right thing to do, and I would do the same. Maybe I'm just thinking out loud here :) Can anyone chime in? Is there something I'm missing?
[gravatar]
The 301 doesn't help the browser: it will do exactly the same thing with any kind of redirect. But other consumers of the site will appreciate it: news readers, search engines, and the like. For example, if Google has been indexing one of the old pages, when it gets the 301 redirect, it can replace the old URL in its database with the new one. And Bloglines can update people who had subscribed to /blog/rssfull.xml to change their subscriptions to point to /blog/rss.xml.
[gravatar]
Ned, excellent article!

Although I'm trying to do something a little different; is there a way to combine a redirect with a rewrite?

1. Link comes in that's of the old website address: http://www.oldwebaddress.com/pages/products/buffers/pages/2-2.html

2. Match against pattern: pages/products/buffers/pages/[0-9]{2}-[0-9]{2}.html

3. Redirect to page: /range/product-index/gmt-product-index.php#bearings-bushes-1

Is that possible?

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.