Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] What's the best site-crawler utility?



On January 7, 2014, Richard Pieri wrote:
>Remember that I wrote how wikis have a spate of problems? This is the 
>biggest one. There's no way to dump a MediaWiki in a humanly-readable 
>form. There just isn't.

Erm... actually, it's perfectly doable.

For instance, you can write a simple script to hit Special:AllPages
(which links to every article on the wiki), and dump each page to HTML
with curl or wget. (Special:AllPages displays only N links at a time,
so you may need to "page forward" to reach all links.) For each page
titled "Foo Bar," store it as "Foo_Bar.html".  Then fix up the wiki
links in each HTML file with another simple script that tacks ".html"
onto each URL (or I suppose you could leave the files alone and
dynamically add the ".html" via Apache trickery).

Heck, you can convert all pages to PDFs using either htmldoc (free) or
Prince (commercial) with a little work, if you prefer that for static
docs.

Hope this helps.

--
Dan Barrett
Author, "MediaWiki" (O'Reilly Media, 2008)
dbarrett at blazemonger.com



BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org