It all started with a simple bug report: when the user pastes a registered trademark symbol (®) into a textarea is "having children" when it is displayed -- multiple characters are appearing where there was only one before. Many hours later I concluded that my attempts at quick fixes weren't going to work -- I was getting different results from Firefox on Linux than from IE on Windows XP.
My comments below are my experience in Perl on Linux, but much of the advice should be language and platform independent.
30 Second Guide to Character Sets1
- Strings are stored as a sequence of bytes.
- What characters these bytes represent is completely arbitrary, but it is dictated by what the software believes the character set encoding is.
- Every piece of software the touches the string has an opportunity to screw it up if it believes wrong.
- If English is your primary language, you won't notice it is screwed up until you get to unusual characters (registered trademark, smart quotes, foreign language character).
- Fixing it once it is screwed up will be painful.
- Once you fix it, it will break again every time there is new software that touches the string.
Fortunately it is relatively straightforward to debug if you follow:
Ogg's Rules for Debugging Character Set Problems
- Never waste your time debugging by printing the strings to the terminal or to a file. Your screen driver or editor is software. Its interpretation of the character set just introduces more variables into the equation.2
- Always print the bytes as hex or use another reliable mechanism for determine what bytes actually make up the string. For Perl, this is
Devel::Peek::Dump.
- Check the values before and after any operation that modifies the string.
- For languages like Perl that attempt to transparently handle both ASCII and Unicode, check what the language believes the string is before and after any operation that modifies the string.
- Don't use string literals that have characters that aren't 7-bit safe (their ASCII value is < 128), instead specify them with the character set appropriate numeric value. In Perl, always use the Unicode value for this.3
- If a web page says it is using ISO-8859-1 encoding or Latin-1 encoding, assume they are actually using Windows-1252 encoding instead. This is what the browsers assume, so you might as well (this primarily affects the ™ symbol).
- If you don't know what encoding the input is in, you really have no chance of actually knowing what the bytes mean. For an HTML input form this means using a hidden field named
_charset_. You are using one in every form, right? (See this Mozilla bug report for details)
1Read
Advanced Perl Programming for more in depth coverage.
2For example, if you output a Perl string that is not marked as utf8 to UTF-8 terminal as you normally would, it will print correctly (even though it is wrong). If the string is correctly marked as utf8, it will print as if it is screwed up. If you correctly set STDOUT to utf8 (
binmode(STDOUT, ":utf8")), the results will be reversed. Do you even know what character set your terminal uses?
3If you ignore this advice and are using string constants anywhere in your code and those constants have non-English or special characters, make sure you know what character set your language is going to think the file is in (which, of course, is probably unrelated to what character set your OS thinks the file is in). For Perl, if you have
use utf8; at the top of your file, the file will be assumed to be in the UTF-8 encoding. If you use
use encoding "latin-1"; then Perl will assume your file is in Latin-1. See
perldoc utf8.