Greg Reddick: Using File.WriteAllText() with Encoding.UTF8 Writes Byte Order Mark (BOM) EF BB BF

First a little background on ASCII, Unicode, and UTF-8. ASCII (American Standard Code for Information Interchange) is a 50 year old standard, first adopted for teleprinters. It has 127 codes, and works rather well for representing English. As computers were used in other parts of the world, though, they needed some way to represent characters outside the ones available in ASCII. Various schemes were developed, but the one that has become the standard is Unicode.

Unicode represents each character as a numbered code point, allowing most characters in most languages to be represented. The first 127 code points are exactly same values as ASCII, making it a superset of ASCII. Unicode does not have a defined way of representing its code points in bytes, though, and various methods are used. The most popular encoding scheme is called UTF-8.

UTF-8 has the advantage that if the text characters are in the ASCII range, that the length in bytes is the same as ASCII. The length is only larger for representing characters outside the ASCII range.

So, given all that, you might think that the following four lines of C# code should all output the same bytes:

File.WriteAllText(@"c:\temp\Sample.txt", "Hello World!");
File.WriteAllText(@"c:\temp\Sample.txt", "Hello World!", Encoding.Default);
File.WriteAllText(@"c:\temp\Sample.txt", "Hello World!", Encoding.ASCII);
File.WriteAllText(@"c:\temp\Sample.txt", "Hello World!", Encoding.UTF8);

Since the "Hello World!" text is all in the ASCII range, you would expect that all four lines would write the same bytes. The first three lines, do write the same thing, but the fourth line writes something different. Here is a hex dump of the first output of the first three lines:

00000000  48 65 6C 6C 6F 20 57 6F 72 6C 64 21              Hello World!

Here is the hex dump of the Encoding.UTF8 file:

00000000  EF BB BF 48 65 6C 6C 6F 20 57 6F 72 6C 64 21     ...Hello World!

What are those first three bytes, EF BB BF? They are called the Byte Order Mark (BOM). They are supposed to indicate to a system reading the bytes how they are supposed to be read. When encoding the number 1 in binary, it could be encoded 1000000 or 00000001. The first is called Big Endian, and the second is called Little Endian. Most computers today use Little Endian ordering of bits.

Furthermore, when encoding the decimal number 400 in Little Endian, it could be encoded 00000001 10010000 or 10010000 00000001. In other words, the order of the bytes could change. The Byte Order Mark is meant to put a known three bytes at the beginning of the text so the system can figure out what the order of bits and bytes is being represented.

When a system reading Unicode text sees the Byte Order Mark, it is supposed to eat those bytes. However, if the system isn't expecting the BOM, then it displays what looks like three random letters at the beginning of the text, like ï»¿.

So if you want to write UTF-8 with the BOM, then you should use:

File.WriteAllText(@"c:\temp\Sample.txt", "Hello World!", Encoding.UTF8);

On the other hand, if you don't want the BOM, then you should use:

File.WriteAllText(@"c:\temp\Sample.txt", "Hello World!");

They are not the same!

Incidentally, the output of the first four lines are way different from each other if the text included non-ASCII characters, but that is a whole other topic.

Greg Reddick

2015-12-16

Using File.WriteAllText() with Encoding.UTF8 Writes Byte Order Mark (BOM) EF BB BF

1 comment :