Saturday, March 22, 2008

Windows, Unix and Hebrew, Oh my!

As a .Net developer you are not bothered by trivialities such as character encoding, since the framework uses Unicode by default.

But what happens when you need to encode your text so someone else (non .Net) will decrypt it, and that someone uses a single byte per character?

Let's start with few definitions:

ASCII - a standard that uses a single byte for each character, but only defines 128 possible symbols. There is no such thing as "Hebrew ASCII".
ANSI - Same idea, but here you can use the remaining bits (out of a byte) to encode non-English specific characters. The problem is every language uses a specific version. The ANSI character table may look different on different computers, depending on the configuration.
Unicode, UTF-8, etc - Using 2 or more bytes for each character, allowing room for all languages (as long as both sides agree to use the same encoding)

Here are your options:

Encoding.ASCII - Your basic ASCII (7 bits) encoding.
Encoding.Unicode - Unicode encoding.
Encoding.UTF8 - Will encode ASCII text with a single-byte representation, but will switch to a longer representation for non-ASCII strings.
Encoding.Default - ANSI encoding based on the computer's configuration, meaning both sender and receiver should share the same locale.
Encoding.GetEncoding - Uses a specific code-page to determine the desired encoding. You should try using this method if you need ANSI encoding. However, you still need to determine the code page you require.

Encoding.GetEncoding(862) - Uses MS-DOS Hebrew encoding, with Hebrew characters starting at bit 128.
Encoding.GetEncoding(1255) - Uses Windows-1255 Hebrew encoding, with Hebrew characters starting at bit 224. This encoding matches the ISO 8859-8 standard, which is also used by Unix.