From MediaWiki to XWiki part I

Published by Patrick on

As announced in our latest newsletter, we’re moving our internal Wiki from MediaWiki to XWiki, due primarily to a lack of fine-grained permission handling.

XWiki uses so called “Spaces” to separate content on different topics in it’s Wiki. A page belongs to one such space, but you’re free to link between those spaces. You can grant or deny access rights per page and per space. These access rights can restrict a single user or a whole group.

After our move to XWiki, we will have several public spaces for development, general information, etc. and some restricted spaces like finances.

The Plan

The move will take place in two phases:

  1. Export / Conversion to the new markup
  2. Import and assign the spaces

We’ve evaluated the following options to export/convert our pages to XWiki:

  • Move all pages by hand
  • Use one/many RegExp to convert the output of SpecialPages:Export (big XML document with ugly CDATA sections)
  • Transform the HTML page using XSLT to the XWiki markup
  • Use a dialect plugin to HTML::WikiConverter

Moving all our pages by hand was, of course, out of the question. The RegExp option got canned as this would be a one-time solution and you’d have to manually fetch all pages via MediaWiki.

Transforming the HTML page using XSLT would have been a viable solution but extending something existing (HTML::WikiConverter) was more appealing because we could give the community something useful back.

Overview

Lets have an overhead look at our solution. We’ve written two scripts to implement our two phases:

wikifetch.pl
A Perl script that utilizes the HTML::WikiConverter Perl module to convert a single HTML page to the XWiki markup (using my XWiki dialect plugin written to achieve this move).
import.groovy
A Groovy script that bulk-imports all pages into a given space. The pages written by wikifetch.pl are matched by a regular expression and stored to a given space.

Import

HTML::WikiConverter lacked XWiki support but that was easily cured (committing it to CPAN was another issue). Encountering Perl for the first time wasn’t as scary as I thought it would be. And after working with it for some time, you’ll like the possibilities of compressing multiple lines of code into one small line. (that is one damned slippery slope, though. –ed.)

But HTML::WikiConverter was made for converting single pages. That’s where wikifetch.pl comes into play.

wikifetch.pl

This script takes a working-set of Wiki page-names from a file (pending.txt), then downloads & converts them to the XWiki markup. After that, it extracts all internal links and puts them onto the working-stack. The resulting XWiki pages are stored in an output directory, ready for the import.

In the following section, I’ll talk about the details of the implementation. If you don’t want to be bothered with that, just skip ahead to the utilization section.

Implementation

First we have the usual Perl module initialization:

package main;

use warnings;
use strict;

use HTML::WikiConverter;
use HTML::WikiConverter::XWiki;
use Data::Dumper; 
use LWP::Simple;
use URI;

To identify which references are linking to other Wiki pages we’ll need to know the wiki uri:

my $wiki_rel_uri = "/index.php/";
my $wiki_uri = 'http://wiki'.$wiki_rel_uri;

The next few variables will hold our working-stack. Variables prepended by ‘%’ are hashes (the ones you know from your ADT classes). The other ones with an ‘@’ in front of them are arrays.

my %links = ();
my @pending_pages = ();
my %page_is_pending = ();
my %done_pages = ();

MediaWiki has tons of elements that we neither need nor want to have in our resulting XWiki markup. So we’re defining a hash containing attribute-content and attribute-name. The first line will cause the removal of all HTML tags with an attribute ‘class’ with the content ‘editsection’ (<.. class=“editsection” ../>

my %tags_toRemove = ( 'editsection' => 'class',
                      'toc' => 'class' 
                      'column-one' => 'id',
                      'jump-to-nav' => 'id',
                      'siteSub' => 'id',
                      'editsection' => 'class',
                      'printfooter' => 'class',
                      'footer' => 'id'
                    );

The following variable contains a regexp that matches on all extensions that we don’t want to process (images & documents):

my $binformat_filters = '(\.jpg|\.png|\.zip|\.odt|\.gif)$';

The next line is the first that actually executes something:

my $wc = new HTML::WikiConverter(
  dialect => 'XWiki',
  wiki_uri => $wiki_rel_uri,
  preprocess => \&_preprocess,
  space_identifier => 'MySpacePlaceholder'
);

We’ll create an instance of the WikiConverter with the dialect XWiki, then give it our URI (needed to determine if a link is in fact a wiki-link). The next parameter is a reference to our _preprocess function. This preprocess function will remove extra elements from the HTML-Tree that will clutter our output (like MediaWiki navigation elements). The space_identifier is an attribute introduced by HTML::WikiConverter::XWiki and defines the space-prefix, prepended to all links emitted to the resulting file.

The next two lines, though in Perl, should be self explanatory:

# read pending pages from my config-file
_read_config();

# creating output directory
mkdir( "output" );

We’re slowly approaching the main processing loop of our perl-script:

01. while( scalar( @pending_pages ) > 0 ) {
02.   %links = ();
03.   my $page = shift( @pending_pages );
04.   _process_wiki_page( $page );
05.   
06.   # accounting
07.   $done_pages{ $page } = 1;  
08.   delete( $page_is_pending{ $page } );
09.   
10.   # check for new pages
11.   map { print "New page '$_'\n"; 
12. 	    push( @pending_pages, $_ );
13. 	    $page_is_pending{ "$_" } = 1; 
14.       } grep {                               # not already in progress or done                               non-empty
15.                   $_ if (not ((exists $done_pages{ "$_" }) or (exists $page_is_pending{ "$_" }))) and ($_ !~ '^$')
16.               } keys %links;
17.   my $numDone = scalar(keys %done_pages);
18.   my $numTotal = $numDone + scalar(@pending_pages);
19.   print "Progress: $numDone / $numTotal\n";
20. }

I won’t go into details of the above; those of you that are Perl literates should be able to read it.

We get a page from our pending_pages array (line 3) and send it to our main processing sub (everything is a sub in Perl, that’s what I’ve been told). After processing we mark the page as done (line 7) and remove it from the pending hash. The reason for having a pending hash and a pending array is so that we don’t have to search the whole array for a single page. That’s what hashes are for.

Lines 11 to 16 are actually written in the tongue of Mordor; the sound of these words should not be uttered here. After calling _process_wiki_page which in due course will call _preprocess, all links found in the actually processed page get stored to the links hash. We’re iterating over this hash and push all pages not yet processed or pending to the end of our processing-array.

It’s now time to generate some statistics for the user. Lines 17-19 do that and print it to the command-line (scalar( xy ) returns an integer representing the element count).

Now that we’re done with the above code snippet, we’ll dive into our subroutines. The first one reads all pending-pages (CR-separated) from a file called pending.txt. Nothing fancy about it.

sub _read_config {
  print "Reading config…\n";
  @pending_pages = ();
  open FILE , "<pending.txt"; 
  while( <FILE> ) {
    push( @pending_pages, $_ ); 
    $page_is_pending{ $_ } = 1;
  }
  close FILE;
  print "Pending pages:\n";
  print join "\n", @pending_pages;
  
  print "Done reading config\n";
}


In _process_wiki_page, we create the output-file for our XWiki markup and start the actual processing:

sub _process_wiki_page {
  my ( $page_name_orig ) = @_;
  
  open FILE, ">output/"."$page_name_orig" || die "Could not create file output/$page_name_orig";
  my $page_name = "$wiki_uri"."$page_name_orig";

  print "Fetching/processing: $page_name\n";
  my $wiki_text = $wc->html2wiki( uri => $page_name );
  print FILE $wiki_text;
  close FILE;

  # check page_translations for the space to put the file into… mkdir on that name and save the file there for uploading…
  print "Processed…\n";
}

Last but not least, we have the _preprocess function. This is called just after HTML::WikiConverter has parsed the input-file. The argument is a HTML::Tree object.

sub _preprocess {
  my( $tb ) = @_;

The next lines remove all unwanted MediaWiki nodes (as mentioned above, using the tags_toRemove hash):

  #delete all tags below our root node, identified by %tags_toRemove 
  #(e.g. remove all elements with the class-attribute set to 'editsection')
  map { $_->delete; } map { $tb->look_down( $tags_toRemove{ $_ }, $_ ) } keys %tags_toRemove;

After the tree has been cleansed, we go after the links (<a/>-tags). Those have to be non-empty, not a special-page, non-binary-extension and should link into our Wiki.

  # search for a tags, beginning with the wiki url and set these keys (minus the url-part) to 1 in our link hash
  map {
        $_ =~ s/#(.*)//; 
        $links{ $_ } = 1; 
      } 
      grep {                      # non empty        no special pages                                                has to be local              remove local part  
                defined( $_ ) and $_ !~ '^$' and $_ !~ '(Special|Image|Help):' and $_ !~ $binformat_filters and $_ =~ /^$wiki_rel_uri/ and $_ =~ s/$wiki_rel_uri// 
           } map { 
                   $_->attr( 'href' )
                 } $tb->look_down( _tag => 'a' );

What’s left is to escape some special characters (this will eventually be moved to HTML::WikiConverter::XWiki):

  foreach my $node ( $tb->descendants ) {
    if( !$node->look_up( _tag => 'pre' ) ) {
		my $txt = $node->attr('text') || '';
		$txt =~ s/\\/\\\\\\/g;
		$txt =~ s/\[/\\[/g;
		$txt =~ s/\]/\\]/g;
		$node->attr( 'text', $txt );
    }
  }
}

…and we’re done. Phew.

Utilization

To start converting your existing MediaWiki execute the following steps:

  1. Download wikifetch.pl.
  2. Install the required CPAN-modules with perl -MCPAN -e ‘install HTML::WikiConverter::XWiki’
  3. Edit the base_uri of you’re MediaWiki inside wikifetch.pl
  4. Add Main_Page to pending.txt
  5. Execute perl wikifetch.pl

Now you should have a folder named output containing your wiki-content. You can either add these pages to XWiki by hand … or wait for my next article to import the pages automatically.

Finally

The result of wikifetch.pl is an output-directory consisting of files in the XWiki markup. In the next article we’ll learn how to get those files into our XWiki.

Comments

3 Replies

#1 − SACRIFICE

unknown (updated by anon@194.27.80.121)

SACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICESACRIFICE

#2 − Also kind of lacking Perl knowledge…

Marco

…I’m not the guy who originally wrote this article, but I’ll do what I can! ;-)

The complaint that the “request for <http://…valid url> failed” seems to indicate a url-resolution problem, but it’s hard to tell because the error message doesn’t specify the exact problem.

  • Are you sure that the uri is valid?
  • Can you ping it?
  • Are you sure that the HTML::WikiConverter plugin and dialect are installed?

#3 − Well…

Patrick

…as the guy who had written it: The URI printed out by line 112 (Fetching/processing…) is the URI that actually will be fetched.

Some additional things that could go wrong:

  • The wiki is password protected. If so you might get around the issue by setting your URI to http://user:password@thesite.com
  • The server redirects your browser after accessing the URI. I don’t know if perl’s HTTP library handles that.