Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] What's the best site-crawler utility?



>Daniel Barrett wrote:
>> For instance, you can write a simple script to hit Special:AllPages
>> (which links to every article on the wiki), and dump each page to HTML
>> with curl or wget.

On January 7, 2014, Richard Pieri wrote:
>Yes, but that's not humanly-readable. It's a dynamically generated 
>jambalaya of HTML, JavaScript, PHP, CSS, and Ghu only knows what else.

Well, a script doesn't need human-readability. :-) Trust me, this is
not hard. I did it a few years ago with minimal difficulty (using a
couple of Emacs macros, if memory serves).

The HTML source of Special:AllPages is just a bunch of <a> tags (with
some window dressing around it) that all match a simple pattern.

--
Dan Barrett
dbarrett at blazemonger.com




BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org