From MediaWiki to XWiki part I
Published by Patrick on 15. Mar 2007 12:43:33
------------------------------------------------------------------------

As announced in our "latest newsletter"
, we're moving our internal
Wiki from " MediaWiki"  to "XWiki"
, due primarily to a lack of fine-grained permission
handling.

XWiki uses so called "Spaces" to separate content on different topics in it's
Wiki. A page belongs to one such space, but you're free to link between those
spaces. You can grant or deny access rights per page and per space.  These
access rights can restrict a single user or a whole group.

After our move to XWiki, we will have several public spaces for development,
general information, etc. and some restricted spaces like finances.

[The Plan]

The move will take place in two phases:

   1. Export / Conversion to the new markup
   2. Import and assign the spaces

We've evaluated the following options to export/convert our pages to XWiki:

  * Move all pages by hand
  * Use one/many RegExp to convert the output of SpecialPages:Export (big XML
    document with ugly CDATA sections)
  * Transform the HTML page using XSLT to the XWiki markup
  * Use a dialect plugin to HTML::WikiConverter

Moving all our pages by hand was, of course, out of the question. The RegExp
option got canned as this would be a one-time solution and you'd have to
manually fetch all pages via MediaWiki.

Transforming the HTML page using XSLT would have been a viable solution but
extending something existing (HTML::WikiConverter) was more appealing because we
could give the community something useful back.

[Overview]

Lets have an overhead look at our solution. We've written two scripts to
implement our two phases:

wikifetch.pl

   A Perl script that utilizes the HTML::WikiConverter Perl module to convert a
   single HTML page to the XWiki markup (using my "XWiki dialect plugin"
    written to
   achieve this move).

import.groovy

   A Groovy script that bulk-imports all pages into a given space. The pages
   written by wikifetch.pl are matched by a regular expression and stored to a
   given space.


[Import]

HTML::WikiConverter lacked XWiki support but that was easily cured (committing
it to CPAN was another "issue"
). Encountering Perl for the
first time wasn't as scary as I thought it would be. And after working with it
for some time, you'll like the possibilities of compressing multiple lines of
code into one small line. (that is one damned slippery slope, though. --ed.)

But HTML::WikiConverter was made for converting single pages. That's where
wikifetch.pl comes into play.

[wikifetch.pl]

This script takes a working-set of Wiki page-names from a file (pending.txt),
then downloads & converts them to the XWiki markup. After that, it extracts all
internal links and puts them onto the working-stack. The resulting XWiki pages
are stored in an output directory, ready for the import.

In the following section, I'll talk about the details of the implementation. If
you don't want to be bothered with that, just skip "ahead" <#util> to the
"utilization section" <#util>.

[Implementation]

First we have the usual Perl module initialization:

package main;

use warnings;
use strict;

use HTML::WikiConverter;
use HTML::WikiConverter::XWiki;
use Data::Dumper; 
use LWP::Simple;
use URI;

To identify which references are linking to other Wiki pages we'll need to know
the wiki uri:

my $wiki_rel_uri = "/index.php/";
my $wiki_uri = 'http://wiki'.$wiki_rel_uri;

The next few variables will hold our working-stack. Variables prepended by '%'
are hashes (the ones you know from your ADT classes). The other ones with an '@'
in front of them are arrays.

my %links = ();
my @pending_pages = ();
my %page_is_pending = ();
my %done_pages = ();

MediaWiki has tons of elements that we neither need nor want to have in our
resulting XWiki markup. So we're defining a hash containing attribute-content
and attribute-name. The first line will cause the removal of all HTML tags with
an attribute 'class' with the content 'editsection' (<.. class="editsection"
../>

my %tags_toRemove = ( 'editsection' => 'class',
                      'toc' => 'class' 
                      'column-one' => 'id',
                      'jump-to-nav' => 'id',
                      'siteSub' => 'id',
                      'editsection' => 'class',
                      'printfooter' => 'class',
                      'footer' => 'id'
                    );

The following variable contains a regexp that matches on all extensions that we
don't want to process (images & documents):

my $binformat_filters = '(\.jpg|\.png|\.zip|\.odt|\.gif)$';

The next line is the first that actually executes something:

my $wc = new HTML::WikiConverter(
  dialect => 'XWiki',
  wiki_uri => $wiki_rel_uri,
  preprocess => \&_preprocess,
  space_identifier => 'MySpacePlaceholder'
);

We'll create an instance of the WikiConverter with the dialect XWiki, then give
it our URI (needed to determine if a link is in fact a wiki-link). The next
parameter is a reference to our _preprocess function. This preprocess function
will remove extra elements from the HTML-Tree that will clutter our output (like
MediaWiki navigation elements). The space_identifier is an attribute introduced
by HTML::WikiConverter::XWiki and defines the space-prefix, prepended to all
links emitted to the resulting file.

The next two lines, though in Perl, should be self explanatory:

# read pending pages from my config-file
_read_config();

# creating output directory
mkdir( "output" );

We're slowly approaching the main processing loop of our perl-script:

01. while( scalar( @pending_pages ) > 0 ) {
02.   %links = ();
03.   my $page = shift( @pending_pages );
04.   _process_wiki_page( $page );
05.   
06.   # accounting
07.   $done_pages{ $page } = 1;  
08.   delete( $page_is_pending{ $page } );
09.   
10.   # check for new pages
11.   map { print "New page '$_'\n"; 
12. 	    push( @pending_pages, $_ );
13. 	    $page_is_pending{ "$_" } = 1; 
14.       } grep {                               # not already in progress or
done                               non-empty
15.                   $_ if (not ((exists $done_pages{ "$_" }) or (exists
$page_is_pending{ "$_" }))) and ($_ !~ '^$')
16.               } keys %links;
17.   my $numDone = scalar(keys %done_pages);
18.   my $numTotal = $numDone + scalar(@pending_pages);
19.   print "Progress: $numDone / $numTotal\n";
20. }

I won't go into details of the above; those of you that are Perl literates
should be able to read it.

We get a page from our pending_pages array (line 3) and send it to our main
processing sub (everything is a sub in Perl, that's what I've been told). After
processing we mark the page as done (line 7) and remove it from the pending
hash. The reason for having a pending hash and a pending array is so that we
don't have to search the whole array for a single page. That's what hashes are
for.

Lines 11 to 16 are actually written in the tongue of Mordor; the sound of these
words should not be uttered here. After calling _process_wiki_page which in due
course will call _preprocess, all links found in the actually processed page get
stored to the links hash. We're iterating over this hash and push all pages not
yet processed or pending to the end of our processing-array.

It's now time to generate some statistics for the user. Lines 17-19 do that and
print it to the command-line (scalar( xy ) returns an integer representing the
element count).

Now that we're done with the above code snippet, we'll dive into our
subroutines. The first one reads all pending-pages (CR-separated) from a file
called pending.txt. Nothing fancy about it.

sub _read_config {
  print "Reading config...\n";
  @pending_pages = ();
  open FILE , " ) {
    push( @pending_pages, $_ ); 
    $page_is_pending{ $_ } = 1;
  }
  close FILE;
  print "Pending pages:\n";
  print join "\n", @pending_pages;
  
  print "Done reading config\n";
}

  
In _process_wiki_page, we create the output-file for our XWiki markup and start
the actual processing:

sub _process_wiki_page {
  my ( $page_name_orig ) = @_;
  
  open FILE, ">output/"."$page_name_orig" || die "Could not create file
output/$page_name_orig";
  my $page_name = "$wiki_uri"."$page_name_orig";

  print "Fetching/processing: $page_name\n";
  my $wiki_text = $wc->html2wiki( uri => $page_name );
  print FILE $wiki_text;
  close FILE;

  # check page_translations for the space to put the file into... mkdir on that
name and save the file there for uploading...
  print "Processed...\n";
}

Last but not least, we have the _preprocess function. This is called just after
HTML::WikiConverter has parsed the input-file. The argument is a HTML::Tree
object.

sub _preprocess {
  my( $tb ) = @_;

The next lines remove all unwanted MediaWiki nodes (as mentioned above, using
the tags_toRemove hash):

  #delete all tags below our root node, identified by %tags_toRemove 
  #(e.g. remove all elements with the class-attribute set to 'editsection')
  map { $_->delete; } map { $tb->look_down( $tags_toRemove{ $_ }, $_ ) } keys
%tags_toRemove;

After the tree has been cleansed, we go after the links (-tags). Those have
to be non-empty, not a special-page, non-binary-extension and should link into
our Wiki.

  # search for a tags, beginning with the wiki url and set these keys (minus the
url-part) to 1 in our link hash
  map {
        $_ =~ s/#(.*)//; 
        $links{ $_ } = 1; 
      } 
      grep {                      # non empty        no special pages           
                                    has to be local              remove local
part  
                defined( $_ ) and $_ !~ '^$' and $_ !~ '(Special|Image|Help):'
and $_ !~ $binformat_filters and $_ =~ /^$wiki_rel_uri/ and $_ =~
s/$wiki_rel_uri// 
           } map { 
                   $_->attr( 'href' )
                 } $tb->look_down( _tag => 'a' );

What's left is to escape some special characters (this will eventually be moved
to HTML::WikiConverter::XWiki):

  foreach my $node ( $tb->descendants ) {
    if( !$node->look_up( _tag => 'pre' ) ) {
		my $txt = $node->attr('text') || '';
		$txt =~ s/\\/\\\\\\/g;
		$txt =~ s/\[/\\[/g;
		$txt =~ s/\]/\\]/g;
		$node->attr( 'text', $txt );
    }
  }
}

...and we're done. Phew.

[Utilization]

To start converting your existing MediaWiki execute the following steps:

   1. "Download wikifetch.pl"
      .
   2. Install the required CPAN-modules with perl -MCPAN -e 'install
      HTML::WikiConverter::XWiki'
   3. Edit the base_uri of you're MediaWiki inside wikifetch.pl
   4. Add Main_Page to pending.txt
   5. Execute perl wikifetch.pl

Now you should have a folder named output containing your wiki-content. You can
either add these pages to XWiki by hand ... or wait for my next article to
import the pages automatically.

[Finally]

The result of wikifetch.pl is an output-directory consisting of files in the
XWiki markup. In the next article we'll learn how to get those files into our
XWiki.