[Discuss] What's the best site-crawler utility?

Daniel Barrett dbarrett at blazemonger.com
Tue Jan 7 21:47:37 EST 2014


On January 7, 2014, Richard Pieri wrote:
>Remember that I wrote how wikis have a spate of problems? This is the 
>biggest one. There's no way to dump a MediaWiki in a humanly-readable 
>form. There just isn't.

Erm... actually, it's perfectly doable.

For instance, you can write a simple script to hit Special:AllPages
(which links to every article on the wiki), and dump each page to HTML
with curl or wget. (Special:AllPages displays only N links at a time,
so you may need to "page forward" to reach all links.) For each page
titled "Foo Bar," store it as "Foo_Bar.html".  Then fix up the wiki
links in each HTML file with another simple script that tacks ".html"
onto each URL (or I suppose you could leave the files alone and
dynamically add the ".html" via Apache trickery).

Heck, you can convert all pages to PDFs using either htmldoc (free) or
Prince (commercial) with a little work, if you prefer that for static
docs.

Hope this helps.

--
Dan Barrett
Author, "MediaWiki" (O'Reilly Media, 2008)
dbarrett at blazemonger.com



More information about the Discuss mailing list