kuujinbo_dot_info

Mashup or Screen Scraping?

Posted 2009-08.

Lately one of the big Web 2.0 buzzwords is Mashups. If you ask me it's just screen scraping renamed. Whatever you prefer, one technique for adding content to your site from a third party is to hack RSS feeds. (the technical community's definition of hack, NOT the media's definition - explained here)

RSS has gone through a few specifications, the latest RSS 2.0. But all formats are XML so they're easy to hack. For example, the Yahoo! Weather RSS feed has a simple, well-defined public interface that makes it easy to add weather to your site.

Getting Local Weather

Is simple; issue a HTTP GET request from your application, specifying a location code (or US zip code) and desired temperature unit as querystring parameters, and you get a XML document. In other words it's a two-step process to get the data you want:

  1. Use the WebClient class to issue the HTTP GET.
  2. Use the XmlReader class to parse the XML document and get the data you want.

Something like this:

private string _get_weather() {
  // API documentation: //http://developer.yahoo.com/weather/
  string url = @"http://weather.yahooapis.com/forecastrss?p=" + _location;
  if (_celcius) url += @"&u=c";
  string xml = "";
  // get Yahoo! Weather RSS feed for specified location in 
  // specified temperature unit
  using (WebClient wc  = new WebClient()) {
    wc.Encoding = System.Text.Encoding.UTF8;
    xml         = wc.DownloadString(url);
  }

  string results = "";
  // 'item' XML node contains 'title' element we want
  // boolean flags used to ignore all but wanted nodes
  bool seen_item  = false;
  bool seen_title = false;
  bool seen_cdata = false;

  // SAX serial access reader
  using (XmlReader xr  = XmlReader.Create(new StringReader(xml))) {
    xr.MoveToContent();
    while (xr.Read()) {
      if (seen_cdata) break;

      if (xr.NodeType == XmlNodeType.Element && !seen_title) {
        if (xr.Name == "item") seen_item = true;
        if (seen_item && xr.Name == "title") {  // location/time
          results += "<h1>" + xr.ReadElementContentAsString() + @"</h1>";
          seen_item = true;
        }
      }

      // CDATA stores local weather in a HTML string
      if (xr.NodeType == XmlNodeType.CDATA) {
        results += xr.ReadContentAsString();
        seen_cdata = true;
      }
    }
  }
  return seen_cdata
    ? results
    : String.Format( // no CDATA section, no weather
      "Sorry, your location '{0}' was not found. Please try again.",
      _location
    );
}

Notes

  • It should be evident from the comments that we're using a SAX serial-access parser. (see SAX vs DOM parsers)
  • It should also be evident that if the XML document returned by the feed does not contain a CDATA (character data) section, the requested location could not be found.
  • I took the easy way out and used jQuery to implement the demo. (no ASP.NET server controls)
  • I searched for, but couldn't find, a list of world-wide location codes. So I saved the Japan location codes page and parsed the file with a simple Perl script:
use strict;
use warnings;
use HTML::TreeBuilder;
use URI;

# local saved file. maybe if i get more motivated
# i'll add a LWP::UserAgent to screen scrape
# http://weather.yahoo.com/ and save all the location
# codes in a database...
my $t     = HTML::TreeBuilder->new_from_file('yj.htm');
my $body  = $t->find('body'); 
my $l     = $body->extract_links('a');

my $wanted = '/forecast/JAXX';
my %ids;

for my $href (@$l) {
  next if !$href->[0]
    || $href->[0] !~ m#$wanted#io
  ;

  my $id = $1 if $href->[0] =~ m#(jaxx\d+)#i;
  # if match save location code   ^^^^^^^
  my $k = $href->[1]->as_text();
  # and text for hyperlink
  $ids{$k} = $id  if $id && $k; 
}

# dump HTML fragment for Ajax-enabled UI
print "<select>\n";
print "<option value='$ids{$_}'>$_</option>\n" for sort keys %ids;
print '</select>';

US Weather

Zip Code:

Japan Weather