Rate this page

Flattr this

Convert a text file from one character encoding to another

Tested on

Debian (Etch, Lenny, Squeeze)
Ubuntu (Lucid, Maverick, Natty, Precise, Trusty)

Objective

To convert a text file from one character encoding to another

Scenario

Suppose that you have received a text file called input.txt that is encoded using ISO 8859-1. You wish to convert it to UTF-8, writing the result to a file called output.txt.

Method

The standard method for converting between character encodings is to use the iconv command:

iconv -f UTF-8 -t ISO-8859-1 input.txt > output.txt

The -f option specifies the input encoding (‘from’) and the -t option specifies the output encoding (‘to‘). Both of these must be encodings that are supported by iconv, and they must be specified using a name that iconv recognises. A list of supported encodings can be obtained using the -l or --list option:

iconv -l

The input and output encodings default to the current locale, however POSIX requires that at least one of these be specified. If no input files are listed then iconv acts as a filter.

If iconv encounters a character that cannot be represented using the selected output encoding then its default behaviour is to terminate with an error message and a non-zero exit status. Use the -c option to instruct it to ignore such characters.

Testing

The following command should convert a thorn character (þ) from ISO 8859-1 to UTF-8, then display the result in hexadecimal:

echo -n $'\xfe' | iconv -f ISO-8859-1 -t UTF-8 | hd

The result should consist of two bytes:

00000000  c3 be

Similarly, it should be possible to perform this conversion in the opposite direction:

echo -n $'\xc3\xbe' | iconv -f UTF-8 -t ISO-8859-1 | hd

in which case the result should consist of one byte:

00000000  fe

Further reading

Tags: shell