You are not logged in. Log in now?

Show a Story

Making sense of GOA's WAR news format

You can easily get the news content of GOA's war-europe.com from e.g. here (German version). Replace the "de" in the URL by the required language code. It is an XML file with a rather simple top-level structure:

Why the hell GOA wouldn't stick with good old-fashioned HTML in the first place, as even the US site does, is a miracle. Instead they decided to construct a horrid mess of badly working Flash that would have been embarrassing in 2001. The only good thing I can say about it is that the login process is encrypted, contacting an HTTPS server (authid.goa.com).

But for now, here's the XML structure as far as I could figure it out.

Element structure

WAR

Attributes:

news

Attributes:

ti, hl, ct

Those elements are title, headline, and content. Title and headline are straightforward text in CDATA. The content element is more fun, since it encapsulates some kind of pseudo-HTML markup as CDATA.

Markup Structure

A HTML conversion could look like:

  <h1>(contents of ti)</h1>
  <h2>(contents of hl)</h2>
  (postprocessed contents of ct)

The postprocessing is probably the trickiest part, since it involves transforming semantically loose structures to more rigid XHTML. Steps would include paragraph separation and list item separation.

Update 2009-04-23: One of today's news items brought along a list started with 'li' and terminated by '/lI'. This seems to suggest that whatever they are using for the Flash site is case insensitive, while XML is not. So during processing, it would be advisable to generally lowercase all tag names.

Postprocessing in Ruby

Given a message body, a single level of lists can be easily converted to MediaWiki markup by the following construct:

# let input be the raw CDATA
input.gsub!(/<li>(.*?)<\/li>/m) { |m|
  $1.split(/[\r\n]+/).collect! {|i| "* #{i.strip}\n"}
}
The output is suitable for processing by MediaCloth.

By Shadowdancer, 2009-04-03, 12:27; permalink;
Last updated at 2009-04-23, 17:09 by Shadowdancer

Powered by merb 1.0.15 and DataMapper 0.10.2.