Importing mediawiki pages into a WordPress blog

Tagged:  •    •    •    •  

Jim Groom and others have been getting excited lately tools and techniques for integrating blogs and wikis. A while ago we had worked on a one-off script to bring the content of a wiki directly into WordPress blog. The idea is similar to XML separation of content and display--wikis are fantastic as a collaborative writing space, but don't have the robustness for presentation that WordPress has. So, we worked on a quick-and-dirty, one-off way to bring content from the writing space of one wiki page and stuff it into WordPress, treating WordPress as the presentation space.

Success was mixed across different wikis, but we thought we'd throw it out to the world to see who has done similar things, and especially if anyone wants to give me a lesson or two on PHP. The first step was to install the exec-PHP plugin for WordPress, which lets you insert php code into your post. (Drupal users have it easier--just use PHP code for the input format). Here's the code we used:




 $page2grab = "http://bavatuesdays.com/bavawiki/index.php?title=Bavawiki_home"; //the mediawiki page

 $baseDir = "http://bavatuesdays.com/bavawiki/"; //to help sort out internal links
 $curl = curl_init();
 curl_setopt($curl, CURLOPT_URL, $page2grab);
 curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
 
 $page = curl_exec($curl);
 curl_close($curl);
 $page = str_replace(' ', ' ', $page); //kills the nbsp entitiy--make sure visual editor doesn't clobber it'
 $page = strstr($page, '<'); //kills the wacky feff character at the beginning of some page grabs

 $wppageDOM = new DOMDocument();
 
 //load as HTML (not XML) to be less rigorous about validity
 //a more sophisticated script would run it through Tidy?
 $wppageDOM->loadHTML($page); 


 $divs = $wppageDOM->getElementsByTagName('div');
 $divs2kill = array();
 //go through and remove unwanted divs like headers, foots, etc.
 //this would probably be much better done with XPaths
 foreach ($divs as $div) {
 	if ($div->getAttribute('id') == 'content') {
 		$contentDiv = $div;
 	}
 	if ($div->getAttribute('id') == 'jump-to-nav' ) {
 		$div->parentNode->removeChild($div);
 	}
 	if ($div->getAttribute('class') == 'editsection') {
		$divs2kill[] = $div; 		
 	} 
 	$div->removeAttribute('id');
  }

  foreach ($divs2kill as $div) {
  	$div->parentNode->removeChild($div);
  }
  
  //bodyContentDOM will be the content of the post in WordPress
  $bodyContentDOM = new DOMDocument();
 
 
 
 $imported = $bodyContentDOM->importNode($contentDiv, true);
 $bodyContentDOM->appendChild($imported);
 
 
 //try to make relative paths to images absolute
 $imgs = $bodyContentDOM->getElementsByTagName('img');
 foreach ($imgs as $img) {
 	$origSrc = $img->getAttribute('src');
 	
	if (false === strpos($origSrc, 'http')) {
  		$img->setAttribute('src', $baseDir . $origSrc);
	} 
 	
 }
 
 //try to make relative paths in links absolute
  $as = $bodyContentDOM->getElementsByTagname('a');
  foreach ($as as $a) {
	$origSrc = $a->getAttribute('href');
	if (false === strpos($origSrc, 'http')) {
  		$a->removeAttribute('href');
  		$a->setAttribute('href', $baseDir . $origSrc);
	} 
  }

//spit the HTML out into the post
 echo $bodyContentDOM->saveHTML();


Thoughts, comments, howling ridicule at my code? :)

Trackback URL for this post:

http://www.patrickgmj.net/trackback/94

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I'd like to do this, but on a mass scale with a downloaded xml file from a large mediawiki site I have. I also want to strip the images (or download them). Think it's possible? Others are importing the xml files into drupal, then importing it into wordpress - that might work out but it seems like such a pain when all you're doing is parsing an xml file, and importing it into wordpress. Wordpress has some nice importers anyway right?

is explained at http://blog.unto.net/meta/the-little-machine/

Thanks for the tip!

Amazing what you can get done in a quiet workspace. :) Nice job Patrick.

Don't think I'm not keeping an eye on this you sly dog. Your name keeps popping up here for what we know you can do for us in the Bliki dream! Keep it up!!!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options