Content with Style

Web Technique

A mass validation shellscript

by Pascal Opitz on December 14 2008, 19:45

I was looking for a CLI script that validates a whole site for me, but I couldn't find one that would work without installation issues. So I hacked together an example shell script that does the job for me by downloading the whole site and then running the files through a validation.

Prerequisites

The shell script uses CURL and WGET (WGET for OSX, in my case), plus the "SOAP API" of the w3c validator.

I am putting "SOAP API" in quotation marks, because it is not really supporting SOAP calls, but only wraps the response into a Soap Envelope. That's why I am using CURL to post the files.

For this example I installed Validator S.A.C. and followed the instructions to get it running as local service. Of course, if you are on linux, you can install it from source or as package. Alternatively you can change the script to use validator.w3.org/check instead of localhost/w3c-validator/check, but it might run pretty slow and create a lot of traffic.

The script

Also a word of warning: The script creates a temp directory and a log.txt file, which it deletes before creating them. I am in no way responsible for any of your stuff getting deleted by running this.

But hey: Feel free to alter this to fit your needs (and maybe post improvements in the comments, for example for my sloppy way of detecting whether it is an HTML file).


#!/bin/sh
#
# Script to validate files in directory
#
is_html() {
  file=$1 
  htmlstart=`grep '<html' $1`

  if [ "$htmlstart" != "" ]; 
  then 
    echo "1";
  fi
}

validate_file() {
  curl -s -F uploaded_file=@$1 -F output=soap12 localhost/w3c-validator/check
}

download_site() {
  cd temp
  echo 'downloading files ...'
  wget -r -q -k -x -E -l 0 $1
  echo 'done downloading files'
  cd ..
}

setup() {
  rm -f log.txt
  rm -Rf temp
  mkdir temp
  touch log.txt
}

run_validation() {
  for file in `find $1`;
    do 
      htmltrue=`is_html $file`
    
      if [ "$htmltrue" = "1" ];
      then
        echo "request validation: $file"
        rpc=`validate_file $file`
      
        echo "checking response: $file"
        noerror=`echo $rpc | grep '<m:errorcount>0</m:errorcount>'`

        if [ "$noerror" = "" ];
        then
          echo "Error in file $file"
          echo "----------------" >> log.txt
          echo "Error in file $file\n" >> log.txt
          echo $rpc >> log.txt
          echo "\n" >> log.txt
          echo "----------------" >> log.txt
        fi
      fi
    done;
  
  has_errors=`cat ./log.txt | grep Error`
      
  if [ "$has_errors" = "" ];
  then
    echo "no errors found\n" >> log.txt
  fi
}

setup
download_site $1
run_validation ./temp/

Update

I slightly modified it, so I do get better error messages. I use xsltproc for parsing the SOAP envelope returned by the validator. Here is the updates script:


#!/bin/sh
#
# Script to validate files in directory
#
is_html() {
  file=$1 
  htmlstart=`grep '<html' $1`

  if [ "$htmlstart" != "" ]; 
  then 
    echo "1";
  fi
}

validate_file() {
  curl -s -F uploaded_file=@$1 -F output=soap12 localhost/w3c-validator/check
}

download_site() {
  cd temp
  echo 'downloading files ...'
  wget -r -q -k -x -E -l 0 $1
  echo 'done downloading files'
  cd ..
}

setup() {
  rm -f log.txt
  rm -Rf temp
  mkdir temp
  touch log.txt
}

run_validation() {
  for file in `find $1`;
    do 
      htmltrue=`is_html $file`
    
      if [ "$htmltrue" = "1" ];
      then
        echo "request validation: $file"
        rpc=`validate_file $file`
      
        echo "checking response: $file"
        noerror=`echo $rpc | grep '<m:errorcount>0</m:errorcount>'`

        if [ "$noerror" = "" ];
        then
		  filelocation=`echo $file | sed "s/\/\//\//g"`
          echo $rpc > temp_error.xml
          xsltproc --stringparam location $filelocation error_template.xsl temp_error.xml >> log.txt
          rm temp_error.xml

          echo "Error in file $file"
        fi
      fi
    done;
  
  has_errors=`cat ./log.txt | grep Error`
      
  if [ "$has_errors" = "" ];
  then
    echo "no errors found\n" >> log.txt
  fi
}

setup
download_site $1
run_validation ./temp/ $1

As you can see, we need a file called error_template.xsl as well, here is an example file:


<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns="http://www.w3.org/TR/xhtml1/strict"
 xmlns:m="http://www.w3.org/2005/10/markup-validator"
 xmlns:env="http://www.w3.org/2003/05/soap-envelope"
>
  <xsl:output 
    method="text"
    omit-xml-declaration="yes"    
  />
  
  <xsl:param name="location" />

  <xsl:template match="/">
  	<xsl:call-template name="divider" />
  	<xsl:value-of select="//m:errorcount" />
  	<xsl:text> Errors in </xsl:text>
  	<xsl:value-of select="$location" />
  	<xsl:call-template name="lb" />
    <xsl:apply-templates select="//m:error" />
  </xsl:template>


  <xsl:template match="m:error">
  	<xsl:text> Line </xsl:text>
  	<xsl:value-of select="m:line" />
  	<xsl:text>, Col </xsl:text>
  	<xsl:value-of select="m:col" />
  	<xsl:text>:</xsl:text>
  	<xsl:call-template name="lb" />
  	<xsl:value-of select="m:message" />
  	<xsl:call-template name="lb" />
  </xsl:template>
  
  <xsl:template name="lb"><xsl:text>
</xsl:text></xsl:template>

  <xsl:template name="divider">
    <xsl:text>--------------</xsl:text><xsl:call-template name="lb" />
  </xsl:template>
</xsl:stylesheet>

I think this would be easily adaptable to produce XML or HTML files. I'd like to figure out where WGET did download the file from, so I could insert that into the output generation, as hyperlink for example. But apart from that I think it performs pretty neat.

Comments

  • With Validator-SAC, you can also run the validator directly from the command-line without setting up a server. The simplest form is just to call it with an http of file URL:

    /Applications/Validator-SAC.app/Contents/Resources/weblet http://habilis.net/

    A full query string can also be used with added parameters:

    /Applications/Validator-SAC.app/Contents/Resources/weblet \'uri=http://apple.com/&output=soap12\'

    The weblet script outputs a CGI response, so the first first few lines are CGI headers:

    
    Content-Type: application/soap+xml;-8
    X-W3C-Validator-Recursion: 1
    X-W3C-Validator-Status: Valid
    X-W3C-Validator-Errors: 0
    
    <?xml version=\"1.0\" encoding=\"UTF-8\"?>
    ... rest of SOAP response ...
    

    by Chuck Houpt on December 15 2008, 01:35 #

  • Thanks Chuck, very helpful insight. Also thanks for creating the Validator-SAC, in the first place. A great tool to have for us Mac dummies.

    by Pascal Opitz on December 15 2008, 10:51 #

  • By the way, I had various people on MSN that were saying I should have tried to avoid downloading the whole site with wget, but use the sitemap instead to pass the URL directly. Good idea, and maybe worth implementing in the future, maybe with a sitemap as optional parameter?
    Also, my experiences with WGET are limited, but there is a spider mode. Maybe it\'s worth just taking the URLS that the spider gets instead of downloading to a temp folder?
    Comments welcome!

    by Pascal Opitz on December 15 2008, 11:01 #

  • Another method would be to use a link-checking program (like Linklint) to crawl the site and produce a list of URLs to validate. Using a link-checker would have the added benefit of checking for broken internal and off-site links.

    by Chuck Houpt on December 20 2008, 14:32 #

  • Useful script, Pascal. :) I usually use -A.html in wget sentence to retrieve just the html files.

    by iñigo medina on August 10 2009, 09:47 #

  • I\'ve written a Python script for the same purpose: http://maestric.com/doc/python/recursive_w3c_html_validator

    by Jérôme Jaglale on October 28 2009, 21:06 #

  • Thanks Pascal for this awesome mass validation script. I seem to be having problems running this for files with extension \".htm\" would you have any ideas as to why? Any feedback would be much appreciated. Thanks!

    by Connie Chung on September 4 2009, 02:46 #