Convert a text file from one character encoding to another
Tested on |
Debian (Etch, Lenny, Squeeze) |
Ubuntu (Lucid, Maverick, Natty, Precise, Trusty) |
Objective
To convert a text file from one character encoding to another
Scenario
Suppose that you have received a text file called input.txt
that is encoded using ISO 8859-1. You wish to convert it to
UTF-8, writing the result to a file called output.txt
.
Method
The standard method for converting between character encodings is to use the iconv
command:
iconv -f UTF-8 -t ISO-8859-1 input.txt > output.txt
The -f
option specifies the input encoding (‘from’) and the -t
option specifies the output encoding (‘to‘). Both of these must be encodings that are supported by iconv
, and they must be specified using a name that iconv
recognises. A list of supported encodings can be obtained using the -l
or --list
option:
iconv -l
The input and output encodings default to the current locale, however POSIX requires that at least one of these be specified. If no input files are listed then iconv
acts as a filter.
If iconv
encounters a character that cannot be represented using the selected output encoding then its default behaviour is to terminate with an error message and a non-zero exit status. Use the -c
option to instruct it to ignore such characters.
Testing
The following command should convert a thorn character (þ) from ISO 8859-1 to UTF-8, then display the result in hexadecimal:
echo -n $'\xfe' | iconv -f ISO-8859-1 -t UTF-8 | hd
The result should consist of two bytes:
00000000 c3 be
Similarly, it should be possible to perform this conversion in the opposite direction:
echo -n $'\xc3\xbe' | iconv -f UTF-8 -t ISO-8859-1 | hd
in which case the result should consist of one byte:
00000000 fe
Further reading
- The Open Group Base Specifications Issue 6, IEEE Std 1003.1, The Open Group, 2004
Tags: shell