I’m working on a documentation project where I might need to convert
some existing HTML pages back into text or Markdown format for the new
system. Rather than manually editing the HTML source, I’m testing with a
couple different ways to script it automatically.
Lynx is an open-source text web browser that is usually present on Linux machines and can be installed for Mac and Windows. I’ve used it in the past to see how web pages will appear to search engines or for accessibility testing.
In both cases, you can quickly tell whether your text is sufficiently
communicating your content.
For the case of saving web pages in text format, Lynx also has a command-line option “-dump”:
$ lynx -dump http://www.whatismyip.com/ > example.txt
In my test case I couldn’t convince Lynx to fetch an SSL page, so I
download it with Curl and pipe it into Lynx:
$ curl --silent https://www.linux.com/blog/Learn/2019/2/miyolinux-Lightweight-distro-Old-School-Approach | lynx -dump -stdin > lynx.txt
Pandoc is an open-source \”universal document converter\” which understands (and can convert between) about two dozen different formats. It\’s well suited for writing a document in a primary source, then converting to
other formats for different publishing options.
The option we\’ll use here is Pandoc\’s ability to convert from HTML to
Markdown, for example:
$ pandoc -s -r html http://www.whatismyip.com/ -o pandoc.md
For my page, I use the same trick as above because Pandoc can\’t connect
to SSL directly:
$ curl --silent https://www.linux.com/blog/Learn/2019/2/miyolinux-Lightweight-distro-Old-School-Approach | pandoc -s -r html -o pandoc.md
Both of these options do a pretty decent job of converting HTML into
text or Markdown format. Pandoc seems slightly better in terms of
getting to Markdown format, but I would need to run some more samples to
see how much manual editing would be needed after.
I\’m also going to play a bit more with Aaron Schwartz\’s
Html2Text. In my quick test, it
appeared to have a problem with malformed HTML so I need to do some
further testing with it.