Latest: Zend Framework: XSL and self-serializing Views

Content with Style

Web Technique

Clean URLs for a better search engine ranking

by Pascal Opitz on February 28 2006, 04:22

Search engines are often key to the successful promotion and running of your website, with high traffic making or breaking your online business. To maximise the visibility of your site in the organic listings of the biggest search engines it is important to strategically work out how keywords are used.

While link building (placing links to the site or from the site) and, most importantly, writing useful content form the foundation of search engine rankings, some careful attention to how your site treats URLs will influence its ranking massively.

URLs

The messy ones

Most big websites are rendered out of a database and it is very rare to find systems generating the pages statically onto a webserver to save processing power. Most small to mid-range CMS make use of on-the-fly rendering and the same applies to most of the tailor-made dynamic sites I've seen so far.
The most common ways of passing information between these dynamic pages are:

  • In a cookie
  • In a session
  • In the host-header (POST)
  • In a the URL as a querystring (GET)

The last one mentioned is by far the most common. It's also the only way that variables passed to an application can be bookmarked and sent by email to other people, since cookies and sessions are bound to the specific computer and browser. But let's have a look how a URL works:

  protocol://myserver/folder/file.ext?queryvariable=value#anchorname

Historically, search engines were not able to spider links with querystring parameters because of page rendering speeds and so-called spider traps. Today, most of the big search engine spiders will follow these untidy links, doing their best to strip out the portions that can cause them trouble. Forcing them to do this, however, makes one of the most common and easy techniques, the GET method and the use of the GET array in various scripting languages, the worst coding technique when it comes to search page rankings.
A URL like this is not ideal for most search engines:

  http://myserver/folder/file.php?pageid=230

The clean ones

Therefore the first step to improve your URLs would be to move information needed to trigger the page rendering into another part of the URL... Something similar to these:

  http://myserver/folder/230/file.php

  http://myserver/folder/230.php

  http://myserver/GUID_whatever_230.php

The meaningful ones

But this still is not the ideal URL. Not for people who have to type it in, nor for search engine rankings, since it doesn't contain any meaningful keywords. An ideal example would look more like this:

  http://myserver/this/url/is/stuffed/with/keywords/index.htm

As you can see, this is more legible than any kind of cryptic ID. It is far more easy to remember for human visitors and it is keyword rich for search engines as well. Google pays especially close attention to the keywords within the URL, and they can, if they match what can be found in the content, drastically improve the ranking. I suggest you try to think of a system that logically makes sense and that represents the path to your page, similar to a breacrumb navigation maybe.
A nice article about dirty and clean URLs can be found on the website of Port80 Software.

Technique

How to rewrite URLs

Now that we have worked out how a good URL should look we can actually rethink the way our web-application renders pages. It's obvious that we need to point the URLs that contain the information to the same file that contained the script dealing with the query string.
There are a couple of ways to do this: Apache's Force Type for example, with others for ASP and .Net, but with PHP and Apache the most comnmon technique to rewrite URLs is the Apache module mod_rewrite.

What is mod_rewrite?

Basically, mod_rewrite is a module for Apache that provides an engine that is able to rewrite URLs to other locations using regular expressions. It is not activated in Apache by default though, and if you run your website on a shared hosting server you might have to ask your hosting provider to get it up and running for you.
To get yourself right into the sytax for URL rewriting I recommend reading A Beginner's Guide to URL Rewriting on sitepoint.com and the URL Rewriting Guide written by Ralf S. Engelschall, the guy who wrote the module.

Rewrite rules and htaccess files

Usually you would define a rewrite rule in an htaccess file put into the roots folder of your site. I'm just giving a little example here rather than going into too much detail. Please check the comments to see what each line does.

RewriteEngine On                          # activate mod_rewrite

RewriteCond %{REQUEST_URI} ^/admin.* [OR] # if in folder admin
RewriteCond %{REQUEST_FILENAME} -f [OR]   # or if the request is a real file
RewriteCond %{REQUEST_FILENAME} -d        # or if an existing directory
RewriteRule ^(.+) - [PT,L]                # then leave the URL as it is

RewriteRule ^(.*) myscript.php            # else rewrite is to myscript.php

A more detailed introduction to Rewrite rules can be found on the manual pages of mod_rewrite. Even a quick look will show you that mod_rewrite offers a sophisticated toolkit for rewriting URLs including the search for files in multiple locations and even time-dependent rewriting. Clean URLs are only one of many reasons to get amiliar with mod_rewrite.

Fancy an example now?

Enough of the theory. Now that we've found out how to rewrite URLs to a specific files I want to give a quick and very simple example of how I tweaked old code quickly and efficiently using mod_rewrite and a bit of code. Afterwards my PHP application was capable of handling clean URLs instead of GET parameters... and the whole thing took me just half an hour.

The old URL

In the existing application the rendering output got triggered by the GET parameter "page_id"

http://server/index.php?page_id=100

The new URL

The pattern for a quick tweak I worked out uses the set prefix "page", then the page_id (that before was found in the get parameter) and finally a modified title slug to improve the indexing.

http://server/page/100/here+are+my+keywords

Three lines of code

All I needed to do was to read the page_id from the URL and assigning it to the GET variable. In this case I used a simple regular expression but you could use explode or any other technique.

<?php
preg_match("//page/(d+)/.*+/", $_SERVER['REQUEST_URI'], $match);

if($match[1])
  $_GET['page_id'] = $match[1];
?>

Security

Always bear in mind that you never should trust the URL. As with all form inputs and GET parameters you need to escape variables taken out of the REQUEST_URI before you use them in your script, otherwise you're inviting script kiddies to hack your application. This is particularly important for scripts that use eval() or write values into databases, store files or do anything else that could cause crucial damage.

Conclusion

Using clean URLs improves your site and the search engine rankings. It's more likely that people will be able to remember certain locations within the site. Your page-rank in Google is likely to go up and you stand a better chance of turning up in search engines.
With mod_rewrite and a couple of small tweaks existing applications can usually be coaxed into using clean URLs.

Comments

  • Nice article! The effect of nice URL’s in search engines is indeed absolutely amazing. It’s one of the best instant-result SEO actions one can possibly perform on a site.

    I actually demonstrated mod_rewrite during one of the Linux courses I’m delivering and the students were amazed with how easy it is to use mod_rewrite.

    by Marco on October 6 2005, 10:15 - #

  • Hi

    Do you have that code for ASP.Net too?

    by Henrik on October 6 2005, 10:31 - #

  • On the ASP.NET front we (work) have just started using aspRedirector from www.intesoft.co.uk/aspRedirector/ – it allows for basically try and catch statements to be added to the web.config file.

    I’ve personally had quite a bit of luck using a different, PHP based, method too. Another Sitepoint plug infact – using the Pathvars class from The PHP Anthology book.

    I had a rambling stab (www.morethanseven.net/weblog/40) at writing some of it down but mine was lets say less clear, consise and more err, error prone that this article.

    by Gareth on October 6 2005, 12:59 - #

  • Henrik: Sadly I have no real knowledge of .Net, but from a quick read this article sounds like an easy possibility. The configuration of the rewrite engine is XML, very neat me thinks!

    by Pascal Opitz on October 6 2005, 17:04 - #

  • I remember trying to figure out the .htaccess files after enabling clean URLs in a Wordpress install and begin absolutely baffled. Nicely explained Pascal!

    Keywords in URLs are given a LOT of weight by the current Google algorithm so this is a great technique for improving organic search engine listings… You just need to look at the URLs of the top 10 sites next time you try to buy some consumer electronics online to see proof of this.

    by Mike Stenhouse on October 7 2005, 06:01 - #

  • Nice article. Always nice to find a new way to construct your URLs.

    A while back I wrote something on the same subject, but with a different approach.

    by Valentin Agachi on October 10 2005, 13:39 - #

  • Thanks Valentin, I was looking into Force-Type myself. But then I prefer mod_rewrite for it’s sophisticated possibilities to exclude certain folders, or just folders, or physically existing files.

    Just to give you a quick example: Imagine a CMS that, in order to save server load, could export the file output. You actually could run both, static files and dynamic output, seamlessly working together by using the condition

    RewriteCond %{REQUEST_FILENAME} -f 

    All requests that don’t point to an existing file get rewritten to the CMS instead of throwing a 404.

    This obviously goes far beyond the possibilites of Force-Type and opens completely new approaches on how to structure an application.

    by Pascal Opitz on October 11 2005, 04:31 - #

  • Pascal, good article and full of useful links, thanks.

    Just one point: does the `id` in the URL add some value? I think not, so if the slug is a field in the database, why use it to select the row (article)?.

    By the way, this very page is accesible through
    http://www.contentwithstyle.co.uk/Articles/64/~
     clean-urls-for-a-better-search-engine-ranking/
    or
    http://www.contentwithstyle.co.uk/Articles/64/
    or
    http://www.contentwithstyle.co.uk/Articles/64/any-random-string/

    by choan on October 11 2005, 11:07 - #

  • choan: Well spotted, an indeed this (old beta) version of textpattern wasn’t able to deal with clean URLs and got tweaked by me in a quick may. Same goes for the example. By all means this is NOT an ideal URL. But it still is way better than query string data.
    The keywords out in are actually only SEO, nothing else. For the application logic they don’t matter at all.
    Actually the article will be reachable with GET parameter as well, in order to keep the old links to the site working. So try

    http://www.contentwithstyle.co.uk/?id=64
    

    And you’ll get here as well. which isn’t a bad thing, is it?

    by Pascal Opitz on October 11 2005, 16:05 - #

  • Does this play nice with search engine duplicate content penalties? Surely having the same page available via 4 different URLs counts as duplicates?

    All it takes is someone to link to you with the “wrong” URL… unless I’m missing something? Using robots.txt to exclude old URL patterns perhaps?

    by Mark on October 14 2005, 03:28 - #

  • Mark, this is a good point. And a good idea to deal with this. However, I can only repeat that the script example is meant as an example, not as the ultimate solution. Same goes for this site. I think you should keep in mind that CwS works with a heavily tweaked textpattern beta version. The site is, unlike a well planned project, “grown organically” and there’s hell of a lot to be improved.

    by Pascal Opitz on October 14 2005, 08:03 - #

  • Regarding the “duplicate content penalties”:

    I’m not sure that they are black-and-white strict in that sense, and in the case of URL rewriting, where’s the benefit for yourself setting up this duplicate? Your clicks and links get devided between 2 identical pages, so would be your search engine ranking. That’s penalty enough, no?
    in this extensive search engine ranking doc duplication is mentioned a couple of times, our case with moderate importance, but, think about what else dilutes your “unique” page: leaving out the “www.” is a whole duplicate set of your website (Pascal just fixed that in the RSS), session keys offer millions of duplicates.
    This very obvious problem makes me think that a penalty is only given when it’s a more severe problem.
    I believe that when it comes to a decision like this, the user should come before the search engine. Let them have 2 urls, or how many they’d like to have. What’s to do on our side is to redirect them to the desired url.

    by Matthias on October 17 2005, 09:03 - #

  • Thanks for your responses guys, I tend to agree – users have to come first, but if you can avoid penalties for dupe content then…

    The www.domain.com vs domain.com issue is also a good point, but a) it’s easy to redirect one to the other and b) search engines would have an easier time catering for this example of duplication than, say a URL like /events/id/32 vs /events.asp?id=32 vs /events/my_event_title.html.

    I guess the upshot is, tough unless you’re going to robots.txt the URL formats you don’t want spiders to pick up on – which as I understand it means listing every URL (including parameters) in robots.txt.

    by Mark on October 17 2005, 11:23 - #

  • Watch this space, I’m going to investigate and gather a couple of thoughts on how to rewrite unwanted URLs into wanted URLs … promised!

    by Pascal Opitz on October 18 2005, 09:57 - #

  • Even though it might be good for some search engines at the present. Personally, I don’t think this is a good solution for the long term because it’s non-standard and when you move server, have to set it up all the customizations again. Search engines should be able to (or they would be) parse pages from the page regardless of URL format.

    by Son Nguyen on October 31 2005, 22:30 - #

  • Son, I don’t agree here … there is actually a w3c recommendation that suggests how an ideal URL should look like.
    A module that is as common as mod_rewrite should be available at most hosting companies and if you move to servers where you can do the configuration yourself it’s even more easy to set up.
    The effort to support accessible and legible URL schemes rather than cryptical querystrings is a minor price to pay, I reckon.
    Another thing to say is that “should be able” implies that they aren’t. Same like “all browsers should be standard compliant” it has nothing to do with the solution of a problem.

    by Pascal Opitz on November 1 2005, 05:06 - #

  • Keywords in URLs do actually give you a LOT of weight by the current Google algorithm so this is a great technique for improving organic search engine listings…

    by Keyword Rankings on November 14 2005, 15:08 - #

  • ok

    by kkaa on November 28 2005, 14:28 - #

  • Nice tutorial on mod rewrite, it’s easier than I thought.

    by Marko on December 27 2005, 18:59 - #

  • Marko, don’t forget that mod_rewrite is more complex than just what I was showing up till this point. In fact it can manage fairly complex rewrite rules using regular expressions. Maybe that’s worth another article though, I’ll look into it in january.

    by Pascal Opitz on December 29 2005, 18:49 - #

  • I wonder if Google just changed the way they deal with duplicate urls. My site just had most of the correct links dropped and the duplicate and often incorrect/outdated links remained.

    In fact, look at this page that shows blogs and their number of google pages indexed. http://blognetworklist.com/bgooglepages.php

    Why so many zeros? Did blog network blogs just get penalized?

    by Dave on January 16 2006, 01:11 - #

  • Awesome article thanks

    by Doug on November 28 2006, 07:45 - #

  • On a scale of 1 to 10, your article “Clean URLS for Search Engine Ranking”, rates a 10, M. Opitz.

    Mod Rewrites on Apache servers are CPU intensive because you are pattern matching against every single file and directory request including .gifs, .jpg, .html, .php etc… and if you have LARGE volumes of traffic, you will eventually run into performance problems because of all this parsing.

    Word to the wise. KISS. Keep your .htaccess files AS SIMPLE and short. Less is better!

    Know that if your website is running on a shared hosting server, your service provider may eventually tell you that you have to get off the shared-server and go dedicated because you are taking up to much CPU and slowing down the HTML publishing of 100 or more shared-hosting websites also hosted on your shared-server.

    But in terms of “Bang for the Buck”, in the spirit of better SEO and Google Pagerank – Apache Mod-Rewrites and CLEAN, HUMAN-readable, keyword-rich URls are worth it!

    If you need a good host with easy to use tools that allows you complete control over your shared hosting environment so you can play with the .htaccess file and use Mod-Rewrites:

    Mike Filsaime, Mark Joyner, Armondo Montelongo (Flip this House on A&E), myself and http://ebiz-iq.com/recommends/kiosk

    Here are some more good HTACCESS tips:
    http://www.IsPopularOnline.com/search/htaccess+rewrite

    Looking forward to your next article!

    JTMcNaught
    http://www.IsPopularOnline
    Where Little-Guys get Noticed Online
    and 1000’s promote your URLs

    by JTMcNaught on May 17 2007, 11:28 - #

  • Does this play nice with search engine duplicate content penalties. Surely having the same page available via 4 different URLs counts as duplicates..

    by Tercüme bürosu on January 16 2008, 04:11 - #

  • Probably not, but then again the example was meant to be a quick fix for querystring driven blogs, not a proper rewrite strategy. I recommend setting up 301 redirects ... in an .htaccess file for example.

    by Pascal Opitz on January 16 2008, 12:40 - #

Leave your comment

Comments are moderated.
Tags allowed: a, strong, em, code, ul, ol, li, q, blockquote, br, p

Advertisement

Want to buy a cheap laptop for your design work? read laptop reviews at laptopical.com