Rate this page

Flattr this

Convert from HTML to formatted plain text using Lynx

Tested on

Debian (Etch, Lenny, Squeeze)
Ubuntu (Lucid, Maverick, Natty, Precise, Trusty)

Objective

To render an HTML document as formatted plain text, taking account of markup where feasible.

Scenario

Suppose you have an HTML document called input.html. You wish to render it as a text file called output.txt.

Method

One way to render HTML as text is to use a text-based web browser such as w3m, Lynx or Links. W3m is recommended here because it has somewhat better support for handling tables, however the best choice may vary depending on the workload and preferred output style.

Normally w3m would operate as an interactive web browser, but it can be run non-interactively by means of the -dump option:

w3m -dump input.html > output.txt

The input filename may be replaced by a URL if required. The output is written to stdout, but can be redirected to a file as in the example above.

The output width defaults to 80 characters. It can be changed using the -cols option:

w3m -dump -cols=120 input.html > output.txt

The default output encoding is chosen to match the locale. If the output is likely to be used on a machine other than the one where it is generated then it is probably desirable for the encoding to be specified explicitly. This can be done by changing the display_charset setting using the -o option:

w3m -dump -o display_charset=UTF-8 input.html > output.txt

Alternatives

Using Lynx

Another text-based browser that could be used for this task is Lynx. Like w3m, it has a non-interactive mode that is selected by the option -dump:

lynx -dump input.html > output.txt

The output width defaults to 80 characters, and the output encoding defaults to ISO-8859-1. The width can be changed using the -width option:

lynx -dump -width 120 input.html > output.txt

and the encoding using the -display_charset option:

lynx -dump -display_charset UTF-8 input.html > output.txt

Using Links

A third possible browser that could be used is Links. Again, this has a non-interactive mode selected by the option -dump:

links -dump input.html > output.txt

The output width appears to default to 80 characters, and the output encoding to plain ASCII. The width can be changed using the -width option:

links -dump -width 120 input.html > output.txt

and the encoding by means of the -codepage option:

links -dump -codepage ISO-8859-1 input.html > output.txt

At the time of writing multi-byte output encodings (including UTF-8) were not supported by the standard version of Links, and were imperfectly supported by variants such as ELinks.

Tags: html