How to Convert Text Encoding

Written By

Revindex Solution

Tags

2007-07-14

Text encoding tells the system how a character is represented in bits and bytes. There are several hundred different known encodings. Different languages also have their own encoding. For instance Japanese, Simplified Chinese and Traditional Chinese are encoded differently. A few popular ones that you probably heard of are ASCII, C, Latin1, Big5, Base63, EBCIDIC, EUC-JP, ISO-8859-1, Unicode, UTF-7, UTF-8, Quoted-Printable (for emails) and many more.

Sometimes you need to convert from one encoding to another because your application can only read certain data encoded in a certain way. For example, older Windows applications only know to read ASCII whereas newer ones (especially Java and .NET) can Unicode and even UTF-8. Even today, most UNIX and Linux applications can only handle C or ASCII encoding.

NIX systems (Linux and UNIX) comes with a command line program called iconv that can be used to convert from one text encoding to another easily. The -c tells the program to ignore unknown characters. The -f specifies the encoding type to change from and the -t is the encoding to change to.

iconv -c -f ASCII -t UTF-8 [FILE_NAME] > [NEW_FILE]

You can download iconv for Windows from the GnuWin32 project here (download the libiconv which includes the iconv.exe application). You will need to download the libintl3.dll library from GnuWin32 also and drop the dll in the bin folder.

To repair a broken UTF-8 file, you can also run it like this by specifying the same encoding type in the from and to parameters:

iconv -c -f UTF-8 -t UTF-8 [FILE_NAME] > [NEW_FILE]

If you cannot get iconv to work, you may want to consider another command line program called recode. Recode does the same thing but is less popular than iconv. I would stick with iconv if given the choice.

Recent Articles

Share

Written By

Tags