BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] What's the best site-crawler utility?

Subject: [Discuss] What's the best site-crawler utility?
From: dbarrett at blazemonger.com (Daniel Barrett)
Date: Wed, 08 Jan 2014 09:43:08 -0500
References: <52CCC46F.8050901@gmail.com>

>Daniel Barrett wrote:
>> For instance, you can write a simple script to hit Special:AllPages
>> (which links to every article on the wiki), and dump each page to HTML
>> with curl or wget.

On January 7, 2014, Richard Pieri wrote:
>Yes, but that's not humanly-readable. It's a dynamically generated 
>jambalaya of HTML, JavaScript, PHP, CSS, and Ghu only knows what else.

Well, a script doesn't need human-readability. :-) Trust me, this is
not hard. I did it a few years ago with minimal difficulty (using a
couple of Emacs macros, if memory serves).

The HTML source of Special:AllPages is just a bunch of <a> tags (with
some window dressing around it) that all match a simple pattern.

--
Dan Barrett
dbarrett at blazemonger.com

Follow-Ups:
- [Discuss] What's the best site-crawler utility?
  - From: richard.pieri at gmail.com (Richard Pieri)

References:
- [Discuss] What's the best site-crawler utility?
  - From: richard.pieri at gmail.com (Richard Pieri)

Prev by Date: [Discuss] Are SQL/NoSQL databases dead?
Next by Date: [Discuss] Small website, non-technical users: Joomla, Drupal, or WordPress? (Solved)
Previous by thread: [Discuss] What's the best site-crawler utility?
Next by thread: [Discuss] What's the best site-crawler utility?
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org