Sunday 20 January 2008 — This is close to 17 years old. Be careful.
When I recently re-built this site, some pages moved. For example, long-ish blog posts used to get a separate page with a name based on a time stamp, but now all blog posts have their own page with a slugged URL, so those old numeric URLs are obsolete.
Rather than abandon those URLs, though, I wanted to make sure that inbound links to my site weren’t broken. To do this, I created a bunch of redirects. As I’m sure I don’t need to explain, a redirect is when a web server returns not an HTML page, but a pointer to another URL that the browser should automatically load instead. The HTTP status code returned with the redirect indicates more about why the redirect is happening. In my case, I want a 301: Moved Permanently.
To implement the redirects, I used two technologies: Apache’s mod_rewrite module, and PHP.
Mod_rewrite is a classic open-source tool: immensely powerful, and powerfully cryptic. I don’t understand most of it, but there are tutorials out there that helped a lot (one, two, three).
To program mod_rewrite, you add lines to the .htaccess file for your Apache installation.
The simplest change I made is that there is no longer a blog/rssfull.xml feed, so I wanted clients requesting it to be redirected to the blog/rss.xml feed instead:
RewriteEngine on
RewriteRule ^blog/rssfull\.xml /blog/rss.xml [R=301,L]
The RewriteRule line does the work. It has three arguments: a regular expression for the URL pattern to match, a result to rewrite it to, and flags. In this case, the regex is simple, we need only ^ to anchor it at the beginning and a backslash to escape the dot, but other than that, it’s just the URL we’re talking about. The result is the URL we want to rewrite to. The flags are what make this a rewrite. R=301 means don’t serve the result URL, but redirect to it, and use a 301 status code. The L flag means this is the last rewrite to apply, so don’t bother reading the rest of the rewrite rules.
I also moved where my tag pages go. What used to be /blog/tag_books.html is now /blog/tag/books.html. Another rule handles these rewrites:
RewriteRule ^blog/tag_([-a-z0-9]+)\.html$ /blog/tag/$1.html [R=301,L]
Here we’re using more regex power to match a set of URLs, and the result includes $1 to pull in part of the match from the pattern. The result is the same, though: a 301 redirect so that the correct URL is served.
One other change I made recently was to canonicalize the host name for my site. Rather than have some references be www.nedbatchelder.com and others simply nedbatchelder.com, I want everyone to use the simpler version. I’m using a mod_rewrite rule to redirect any request with a subdomain (even www) to the plain host name:
# canonicalize the host name (no prefix)
RewriteCond %{HTTP_HOST} \.nedbatchelder\.com$ [NC]
RewriteRule .? http://nedbatchelder.com%{REQUEST_URI} [R=301,L]
Here I use another mod_rewrite directive: RewriteCond. This creates a condition which must be true for the next RewriteRule to match. Here the condition is that the host name must contain “.nedbatchelder.com” (NC means no case, a case-insensitive match). So any dotted prefix will trigger the rewrite rule, which uses .? as a pattern to match all URLs. The result uses %{REQUEST_URI} to redirect the browser to the same URI, but at the simpler hostname.
The last redirects I needed were for those longish blog pages. Here I couldn’t use a single mod_rewrite rule to do all the work, because there’s no pattern to the rewrites. /blog/20030616T212329.html had to become /blog/200306/paul_rands_geometry_books.html. Since there are about 130 of these posts, I could have added 130 lines to .htaccess, but that seemed excessive, and in my setup, it’s a hand-edited file, not generated, so it would have been a pain.
Instead, I used mod_rewrite to rewrite the incoming URL to a generated PHP file that had a table of redirects:
RewriteRule ^blog/200[0-9]{5}T[0-9]{6}.html /blog/moved.php
This was possible because the old URLs fit a pattern: the URL was created from an ISO8601 timestamp, so I could write a regex to match it, in this case a specific number of digits with a T in the middle. The rewrite this time is to /blog/moved.php, and it isn’t a redirect: Apache executes that file directly to come up with a response to the browser.
The moved.php file looks like this:
<?php
// Redirector for old extended blog entries to their new locations.
$oldnew = array(
"/blog/20050908T142058.html" => "/blog/200509/amazon_sales_stats_grabber.html",
"/blog/20071216T110049.html" => "/blog/200712/ancient_history_the_digital_logo.html",
// ... omitting about 125 entries ...
"/blog/20040716T065847.html" => "/blog/200407/windows_themes.html",
"/blog/20041117T084310.html" => "/blog/200411/xml_schema_for_nonxml_data.html"
);
$path = $_SERVER["REQUEST_URI"];
$redir = $oldnew[$path];
if ($redir == "") {
$redir = "/blog/index.html";
}
header("Location: $redir",TRUE,301);
?>
An associative array is used to map the incoming URL to the new URL. If nothing is found, default to the blog index page. Finally, use the PHP header() function to redirect the browser with a 301 status code.
After all of this, what have we got?
- I’ve avoided breaking inbound links. As I was working on this redesign, I went back through some old notes of ideas to try, and about half the interesting URLs I tried were broken. I was sad.
- My URLs are more canonical. This helps with search engines and link aggregators, to make sure I don’t get two entries when one will do.
Comments
RewriteRule ^blog/200[0-9][0-9][0-9][0-9][0-9]T[0-9][0-9][0-9][0-9][0-9][0-9]\.html /blog/moved.php
be simplified to something like
RewriteRule ^blog/200\d{5}T\d{6}\.html /blog/moved.php
(Add backslashes to taste)
http://www.ilovejackdaniels.com/cheat-sheets/mod_rewrite-cheat-sheet/ indicates that mod_rewrite supports the {m,n} syntax.
http://httpd.apache.org/docs/2.2/mod/mod_alias.html#redirect
With some clever searching and replacing you could probably convert your huge associative array in PHP to a .htaccess file with Redirects.
Not sure its worth the trouble though.
Other than being "correct" with the 301, what's the point of a permanent redirect? If you truly want your links to never be broken, you have to leave your redirect there. If the redirect is always there, it wouldn't have to give a 301.
I agree that it's the right thing to do, and I would do the same. Maybe I'm just thinking out loud here :) Can anyone chime in? Is there something I'm missing?
http://html-lesson.blogspot.com/2008/06/redirect-to-web-addres.html
Although I'm trying to do something a little different; is there a way to combine a redirect with a rewrite?
1. Link comes in that's of the old website address: http://www.oldwebaddress.com/pages/products/buffers/pages/2-2.html
2. Match against pattern: pages/products/buffers/pages/[0-9]{2}-[0-9]{2}.html
3. Redirect to page: /range/product-index/gmt-product-index.php#bearings-bushes-1
Is that possible?
Add a comment: