[Discuss] What's the best site-crawler utility?

Tue Jan 7 22:34:20 EST 2014

Hi Bill,

GPL - licensed HTTrack Website Copier works well (http://www.httrack.com/).
 I have not tried it on a MediaWiki site, but it's pretty adept at copying
websites including dynamically generated websites.

They say: "It allows you to download a World Wide Web site from the
Internet to a local directory, building recursively all directories,
getting HTML, images, and other files from the server to your computer.
HTTrack arranges the original site's relative link-structure. Simply open a
page of the "mirrored" website in your browser, and you can browse the site
from link to link, as if you were viewing it online. HTTrack can also
update an existing mirrored site, and resume interrupted downloads. HTTrack
is fully configurable, and has an integrated help system.

WinHTTrack is the Windows 2000/XP/Vista/Seven release of HTTrack, and
WebHTTrack the Linux/Unix/BSD release which works in your browser. There is
also a command-line version 'httrack'.

HTTrack is actually similar in it's result to the wget -k -m -np
http://mysite that Matt mentions, but may be easier in general to use and
offers a GUI to drive the options that you want.

Using the MediaWiki API to export pages is another option if you have
specific needs that can not be addressed by a "mirror" operation (e.g. your
wiki has namespaced contents that you want to treat differently.)  If you
end up exporting via "Special:Export" or the API, then you will be faced
with the option to convert your XML to HTML.  I have some notes about wiki
format conversions at https://freephile.org/wiki/index.php/Format_conversion

There's pandoc.  "If you need to convert files from one markup format into
another, pandoc is your swiss-army knife."
http://johnmacfarlane.net/pandoc/

~ Greg

Greg Rundlett

On Tue, Jan 7, 2014 at 6:49 PM, Bill Horne <bill at horne.net> wrote:

> I need to copy the contents of a wiki into static pages, so please
> recommend a good web-crawler that can download an existing site into static
> content pages. It needs to run on Debian 6.0.
>
> Bill
>
> --
> Bill Horne
> 339-364-8487
>
> _______________________________________________
> Discuss mailing list
> Discuss at blu.org
> http://lists.blu.org/mailman/listinfo/discuss
>