Convert from HTML to formatted plain text using Lynx
|Debian (Etch, Lenny, Squeeze)|
|Ubuntu (Lucid, Maverick, Natty, Precise, Trusty)|
To render an HTML document as formatted plain text, taking account of markup where feasible.
Suppose you have an HTML document called
input.html. You wish to render it as a text file called
One way to render HTML as text is to use a text-based web browser such as w3m, Lynx or Links. W3m is recommended here because it has somewhat better support for handling tables, however the best choice may vary depending on the workload and preferred output style.
Normally w3m would operate as an interactive web browser, but it can be run non-interactively by means of the
w3m -dump input.html > output.txt
The input filename may be replaced by a URL if required. The output is written to
stdout, but can be redirected to a file as in the example above.
The output width defaults to 80 characters. It can be changed using the
w3m -dump -cols=120 input.html > output.txt
The default output encoding is chosen to match the locale. If the output is likely to be used on a machine other than the one where it is generated then it is probably desirable for the encoding to be specified explicitly. This can be done by changing the
display_charset setting using the
w3m -dump -o display_charset=UTF-8 input.html > output.txt
Another text-based browser that could be used for this task is Lynx. Like w3m, it has a non-interactive mode that is selected by the option
lynx -dump input.html > output.txt
The output width defaults to 80 characters, and the output encoding defaults to ISO-8859-1. The width can be changed using the
lynx -dump -width 120 input.html > output.txt
and the encoding using the
lynx -dump -display_charset UTF-8 input.html > output.txt
A third possible browser that could be used is Links. Again, this has a non-interactive mode selected by the option
links -dump input.html > output.txt
The output width appears to default to 80 characters, and the output encoding to plain ASCII. The width can be changed using the
links -dump -width 120 input.html > output.txt
and the encoding by means of the
links -dump -codepage ISO-8859-1 input.html > output.txt
At the time of writing multi-byte output encodings (including UTF-8) were not supported by the standard version of Links, and were imperfectly supported by variants such as ELinks.