BLU Discuss list archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Discuss] My first contribution to MediaWiki
- Subject: [Discuss] My first contribution to MediaWiki
- From: greg at freephile.com (Greg Rundlett (freephile))
- Date: Sat, 17 Jan 2015 23:25:48 -0500
- In-reply-to: <54BB2424.8060908@gmail.com>
- References: <CANaytccQo5Fb-BeTje9vr7iN0yHHX0EQe6Xp_Rxqwyb2WUy_jA@mail.gmail.com> <54BB2424.8060908@gmail.com>
Thanks Tom, good questions. On Sat, Jan 17, 2015 at 10:10 PM, Tom Metro <tmetro+blu at gmail.com> wrote: > Greg Rundlett (freephile) wrote: > > The project page: http://www.mediawiki.org/wiki/Extension:Html2Wiki > > > > It's an extension to MediaWiki that lets you "import a website or web > page > > into your wiki". > > "It does this by first "normalizing" the content with HTMLTidy, and > then "sanitizing" it with Purify and Regular Expressions. Then the > content is "converted" from HTML to WikiText using Regular Expressions > and a Parsoid service." > > Amazing that such a conversion is even possible, given how problematic > most HTML is. In some ways this job is harder than what browsers do when > parsing HTML, as you aren't just rendering the result, but trying to > extract structure - or semantic meaning - from it. > > Does HTMLTidy do a lot of the heavy lifting for you? Do you still end up > with a lot of situations where you have multiple HTML constructs that > map to a single wiki markup construct? > Tidy was chosen to parse non-conforming HTML into (hopefully) valid HTML. At the very least, Tidy would be able to get us from ugly hackish HTML source to something with consistent tag case, attribute quoting, and having a Doctype. > > How does it handle HTML generated or loaded by JS, as is quite common > now? (You might be able to work around that with one of the projects > that use an embedded and programmatically controlled web rendering > engine, like webkit.) > Right now, I'm not trying to work with any scripted content. Actually, <script> tags are summarily stripped, and that is one of the reasons I originally thought I would use "Purifier" to literally parse the content and strip potentially malicious (or simply "breaking") scripted content. So far, I don't need Purifier because I'm able to strip <script> with regex. > > What are the advantages to implementing this as a plugin rather than a > separate command line tool (which would then support other markup > formats, like Markdown)? > I'm creating a MediaWiki extension so that the user can sit down and click buttons on a form to import content. The extension is already reliant on several outside libraries or services, so any command line tool that works is great. > > If you couldn't find an existing HTML to wiki markup converter, I originally thought that the Parsoid service (node.js) would do the conversion for me. Parsoid is a project that the WikiMedia Foundation has been working on to create the "Visual Editor" front-end to wikipedia and does full round-trip mw->html/rdf->mw. It does handle some cases of "wild" html. But, at the moment, it really has some strict expectations for the HTML input. > did you > look for something similar, like a converter to markdown? A search for > this turns up hits, such as: > > http://johnmacfarlane.net/pandoc/README.html > > I did look for other Html2Wiki converters and didn't find anything too useful. I completely forgot about pandoc!! It's been a good tool for a long time. A quick test shows that it's very promising as a tool in the pipeline. With it's capability to handle so many formats on the read side, it would be a better "backend" converter because I could create a form that gives the user a choice for many input formats. My immediate goal is to handle HTML input on a collection of ~thousands of files... essentially converting the HTML output of an HTML help authoring system into MediaWiki wikitext because the Wiki is used as the (better) help authoring system. > with an example: > > pandoc -f html -t markdown http://www.fsf.org > > which presumably retrieves content from http://www.fsf.org, specified to > be in HTML format, and outputs Markdown. (It also supports MediaWiki > format.) > > If using a tool that doesn't support MediaWiki directly, once in > Markdown, I imagine the conversion to MediaWiki is relatively easy. I'm pretty sure there is a markdown to mw converter. With a pandoc backend, you would be able to read straight from HTML to mw.
- References:
- [Discuss] My first contribution to MediaWiki
- From: greg at freephile.com (Greg Rundlett (freephile))
- [Discuss] My first contribution to MediaWiki
- From: tmetro+blu at gmail.com (Tom Metro)
- [Discuss] My first contribution to MediaWiki
- Prev by Date: [Discuss] My first contribution to MediaWiki
- Next by Date: [Discuss] My first contribution to MediaWiki
- Previous by thread: [Discuss] My first contribution to MediaWiki
- Next by thread: [Discuss] My first contribution to MediaWiki
- Index(es):