BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] What's the best site-crawler utility?

Subject: [Discuss] What's the best site-crawler utility?
From: dbarrett at blazemonger.com (Daniel Barrett)
Date: Tue, 07 Jan 2014 21:47:37 -0500
References: <52CC97ED.7030807@gmail.com>

On January 7, 2014, Richard Pieri wrote:
>Remember that I wrote how wikis have a spate of problems? This is the 
>biggest one. There's no way to dump a MediaWiki in a humanly-readable 
>form. There just isn't.

Erm... actually, it's perfectly doable.

For instance, you can write a simple script to hit Special:AllPages
(which links to every article on the wiki), and dump each page to HTML
with curl or wget. (Special:AllPages displays only N links at a time,
so you may need to "page forward" to reach all links.) For each page
titled "Foo Bar," store it as "Foo_Bar.html".  Then fix up the wiki
links in each HTML file with another simple script that tacks ".html"
onto each URL (or I suppose you could leave the files alone and
dynamically add the ".html" via Apache trickery).

Heck, you can convert all pages to PDFs using either htmldoc (free) or
Prince (commercial) with a little work, if you prefer that for static
docs.

Hope this helps.

--
Dan Barrett
Author, "MediaWiki" (O'Reilly Media, 2008)
dbarrett at blazemonger.com

Follow-Ups:
- [Discuss] What's the best site-crawler utility?
  - From: richard.pieri at gmail.com (Richard Pieri)

References:
- [Discuss] What's the best site-crawler utility?
  - From: richard.pieri at gmail.com (Richard Pieri)

Prev by Date: [Discuss] What's the best site-crawler utility?
Next by Date: [Discuss] What's the best site-crawler utility?
Previous by thread: [Discuss] What's the best site-crawler utility?
Next by thread: [Discuss] What's the best site-crawler utility?
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org