Boston Linux & UNIX was originally founded in 1994 as part of The Boston Computer Society. We meet on the third Wednesday of each month at the Massachusetts Institute of Technology, in Building E51.

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] My first contribution to MediaWiki



Thanks Tom, good questions.

On Sat, Jan 17, 2015 at 10:10 PM, Tom Metro <tmetro+blu at gmail.com> wrote:

> Greg Rundlett (freephile) wrote:
> > The project page: http://www.mediawiki.org/wiki/Extension:Html2Wiki
> >
> > It's an extension to MediaWiki that lets you "import a website or web
> page
> > into your wiki".
>
>   "It does this by first "normalizing" the content with HTMLTidy, and
>   then "sanitizing" it with Purify and Regular Expressions. Then the
>   content is "converted" from HTML to WikiText using Regular Expressions
>   and a Parsoid service."
>
> Amazing that such a conversion is even possible, given how problematic
> most HTML is. In some ways this job is harder than what browsers do when
> parsing HTML, as you aren't just rendering the result, but trying to
> extract structure - or semantic meaning - from it.
>
> Does HTMLTidy do a lot of the heavy lifting for you? Do you still end up
> with a lot of situations where you have multiple HTML constructs that
> map to a single wiki markup construct?
>

Tidy was chosen to parse non-conforming HTML into (hopefully) valid HTML.
At the very least, Tidy would be able to get us from ugly hackish HTML
source to something with consistent tag case, attribute quoting, and having
a Doctype.

>
> How does it handle HTML generated or loaded by JS, as is quite common
> now? (You might be able to work around that with one of the projects
> that use an embedded and programmatically controlled web rendering
> engine, like webkit.)
>

Right now, I'm not trying to work with any scripted content.  Actually,
<script> tags are summarily stripped, and that is one of the reasons I
originally thought I would use "Purifier" to literally parse the content
and strip potentially malicious (or simply "breaking") scripted content.
So far, I don't need Purifier because I'm able to strip <script> with
regex.

>
> What are the advantages to implementing this as a plugin rather than a
> separate command line tool (which would then support other markup
> formats, like Markdown)?
>

I'm creating a MediaWiki extension so that the user can sit down and click
buttons on a form to import content.  The extension is already reliant on
several outside libraries or services, so any command line tool that works
is great.

>
> If you couldn't find an existing HTML to wiki markup converter,


I originally thought that the Parsoid service (node.js) would do the
conversion for me.  Parsoid is a project that the WikiMedia Foundation has
been working on to create the "Visual Editor" front-end to wikipedia and
does full round-trip mw->html/rdf->mw.  It does handle some cases of "wild"
html.  But, at the moment, it really has some strict expectations for the
HTML input.


> did you
> look for something similar, like a converter to markdown? A search for
> this turns up hits, such as:
>
> http://johnmacfarlane.net/pandoc/README.html
>
> I did look for other Html2Wiki converters and didn't find anything too
useful.  I completely forgot about pandoc!!  It's been a good tool for a
long time.  A quick test shows that it's very promising as a tool in the
pipeline.  With it's capability to handle so many formats on the read side,
it would be a better "backend" converter because I could create a form that
gives the user a choice for many input formats.  My immediate goal is to
handle HTML input on a collection of ~thousands of files... essentially
converting the HTML output of an HTML help authoring system into MediaWiki
wikitext because the Wiki is used as the (better) help authoring system.


> with an example:
>
>   pandoc -f html -t markdown http://www.fsf.org
>
> which presumably retrieves content from http://www.fsf.org, specified to
> be in HTML format, and outputs Markdown. (It also supports MediaWiki
> format.)
>
> If using a tool that doesn't support MediaWiki directly, once in
> Markdown, I imagine the conversion to MediaWiki is relatively easy.


 I'm pretty sure there is a markdown to mw converter.  With a pandoc
backend, you would be able to read straight from HTML to mw.



BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org