You are not logged in. Log in now?
Show a Story
Making sense of GOA's WAR news format
You can easily get the news content of GOA's war-europe.com from e.g. here (German version). Replace the "de" in the URL by the required language code. It is an XML file with a rather simple top-level structure:
- WAR
- news
- ti
- hl
- ct
- news
Why the hell GOA wouldn't stick with good old-fashioned HTML in the first place, as even the US site does, is a miracle. Instead they decided to construct a horrid mess of badly working Flash that would have been embarrassing in 2001. The only good thing I can say about it is that the login process is encrypted, contacting an HTTPS server (authid.goa.com).
But for now, here's the XML structure as far as I could figure it out.
Element structure
WAR
Attributes:
- srv: ? ("homenews")
- lg: Language
- pop: ? ("logged")
- pb: ? (ISO datetime, perhaps last update?)
news
Attributes:
- d: ISO date
- cat: Category
- order: ? (unused? used to push items to the top?)
- id: Looks like language code plus a numerical ID.
ti, hl, ct
Those elements are title, headline, and content. Title and headline are straightforward text in CDATA. The content element is more fun, since it encapsulates some kind of pseudo-HTML markup as CDATA.
Markup Structure
- Linebreaks are given as <CR>, so it seems like a "normal" approach of regarding <CR><CR> as a paragraph break.
- Lists are given as a single element (li), with items separated by single newlines. Consequently, a list would e.g. look like
<li>A B</li>
- There are actually nested lists, which require a more sophisticated parser than the simple replacement below. Example:
<li>A <li>Aa Ab</li> B</li>
- Image tags are normal HTML, but flow seems to be fucked in GOA's Flash "renderer", so images may be followed by a bunch of newlines which look like a manual attempt at getting the spacing right.
A HTML conversion could look like:
<h1>(contents of ti)</h1> <h2>(contents of hl)</h2> (postprocessed contents of ct)
The postprocessing is probably the trickiest part, since it involves transforming semantically loose structures to more rigid XHTML. Steps would include paragraph separation and list item separation.
Update 2009-04-23: One of today's news items brought along a list started with 'li' and terminated by '/lI'. This seems to suggest that whatever they are using for the Flash site is case insensitive, while XML is not. So during processing, it would be advisable to generally lowercase all tag names.
Update 2010-05-21: The markup allows some things that are tricky to convert to HTML, e.g., runs of bold text spanning paragraphs, lists, and other "block" elements. When parsing the markup, you technically have to make sure you split the text-level formatting, closing it at the end of a generated element, and re-opening it in the next. Seeing how ugly that is handled in the MediaWiki parser, I'm not inclined to touch that.
Postprocessing in Ruby
Given a message body, a single level of lists can be easily converted to MediaWiki markup by the following construct:
# let input be the raw CDATA
input.gsub!(/<li>(.*?)<\/li>/m) { |m|
$1.split(/[\r\n]+/).collect! {|i| "* #{i.strip}\n"}
}
The output is suitable for processing by MediaCloth.
By Shadowdancer, 2009-04-03, 12:27;
permalink;
Last updated at 2010-05-21, 10:37 by Shadowdancer