Convert from HTML to formatted plain text using Lynx
Tested on |
Debian (Etch, Lenny, Squeeze) |
Ubuntu (Lucid, Maverick, Natty, Precise, Trusty) |
Objective
To render an HTML document as formatted plain text, taking account of markup where feasible.
Scenario
Suppose you have an HTML document called input.html
. You wish to render it as a text file called output.txt
.
Method
One way to render HTML as text is to use a text-based web browser such as w3m, Lynx or Links. W3m is recommended here because it has somewhat better support for handling tables, however the best choice may vary depending on the workload and preferred output style.
Normally w3m would operate as an interactive web browser, but it can be run non-interactively by means of the -dump
option:
w3m -dump input.html > output.txt
The input filename may be replaced by a URL if required. The output is written to stdout
, but can be redirected to a file as in the example above.
The output width defaults to 80 characters. It can be changed using the -cols
option:
w3m -dump -cols=120 input.html > output.txt
The default output encoding is chosen to match the locale. If the output is likely to be used on a machine other than the one where it is generated then it is probably desirable for the encoding to be specified explicitly. This can be done by changing the display_charset
setting using the -o
option:
w3m -dump -o display_charset=UTF-8 input.html > output.txt
Alternatives
Using Lynx
Another text-based browser that could be used for this task is Lynx. Like w3m, it has a non-interactive mode that is selected by the option -dump
:
lynx -dump input.html > output.txt
The output width defaults to 80 characters, and the output encoding defaults to ISO-8859-1. The width can be changed using the -width
option:
lynx -dump -width 120 input.html > output.txt
and the encoding using the -display_charset
option:
lynx -dump -display_charset UTF-8 input.html > output.txt
Using Links
A third possible browser that could be used is Links. Again, this has a non-interactive mode selected by the option -dump
:
links -dump input.html > output.txt
The output width appears to default to 80 characters, and the output encoding to plain ASCII. The width can be changed using the -width
option:
links -dump -width 120 input.html > output.txt
and the encoding by means of the -codepage
option:
links -dump -codepage ISO-8859-1 input.html > output.txt
At the time of writing multi-byte output encodings (including UTF-8) were not supported by the standard version of Links, and were imperfectly supported by variants such as ELinks.
Tags: html