You are not logged in. Log in now?

Show a Story

Making sense of GOA's WAR news format

You can easily get the news content of GOA's war-europe.com from e.g. here (German version). Replace the "de" in the URL by the required language code. It is an XML file with a rather simple top-level structure:

Why the hell GOA wouldn't stick with good old-fashioned HTML in the first place, as even the US site does, is a miracle. Instead they decided to construct a horrid mess of badly working Flash that would have been embarrassing in 2001. The only good thing I can say about it is that the login process is encrypted, contacting an HTTPS server (authid.goa.com).

But for now, here's the XML structure as far as I could figure it out.

Element structure

WAR

Attributes:

news

Attributes:

ti, hl, ct

Those elements are title, headline, and content. Title and headline are straightforward text in CDATA. The content element is more fun, since it encapsulates some kind of pseudo-HTML markup as CDATA.

Markup Structure

A HTML conversion could look like:

  <h1>(contents of ti)</h1>
  <h2>(contents of hl)</h2>
  (postprocessed contents of ct)

The postprocessing is probably the trickiest part, since it involves transforming semantically loose structures to more rigid XHTML. Steps would include paragraph separation and list item separation.

Update 2009-04-23: One of today's news items brought along a list started with 'li' and terminated by '/lI'. This seems to suggest that whatever they are using for the Flash site is case insensitive, while XML is not. So during processing, it would be advisable to generally lowercase all tag names.

Update 2010-05-21: The markup allows some things that are tricky to convert to HTML, e.g., runs of bold text spanning paragraphs, lists, and other "block" elements. When parsing the markup, you technically have to make sure you split the text-level formatting, closing it at the end of a generated element, and re-opening it in the next. Seeing how ugly that is handled in the MediaWiki parser, I'm not inclined to touch that.

Postprocessing in Ruby

Given a message body, a single level of lists can be easily converted to MediaWiki markup by the following construct:

# let input be the raw CDATA
input.gsub!(/<li>(.*?)<\/li>/m) { |m|
  $1.split(/[\r\n]+/).collect! {|i| "* #{i.strip}\n"}
}
The output is suitable for processing by MediaCloth.

By Shadowdancer, 2009-04-03, 12:27; permalink;
Last updated at 2010-05-21, 10:37 by Shadowdancer

Powered by merb 1.1.0 and DataMapper 0.10.2.