[Discuss] What's the best site-crawler utility?

Daniel Barrett dbarrett at blazemonger.com
Wed Jan 8 09:43:08 EST 2014


>Daniel Barrett wrote:
>> For instance, you can write a simple script to hit Special:AllPages
>> (which links to every article on the wiki), and dump each page to HTML
>> with curl or wget.

On January 7, 2014, Richard Pieri wrote:
>Yes, but that's not humanly-readable. It's a dynamically generated 
>jambalaya of HTML, JavaScript, PHP, CSS, and Ghu only knows what else.

Well, a script doesn't need human-readability. :-) Trust me, this is
not hard. I did it a few years ago with minimal difficulty (using a
couple of Emacs macros, if memory serves).

The HTML source of Special:AllPages is just a bunch of <a> tags (with
some window dressing around it) that all match a simple pattern.

--
Dan Barrett
dbarrett at blazemonger.com




More information about the Discuss mailing list