UTF8 Everywhere

Posted on 2014-6-20

Where to begin when you're trying to support Unicode? Or rather where do you draw the line when it comes to supporting any printable character not covered by the first 7 bits worth of code points of ASCII? Only you can answer that question for your own project. Wherever that line is it is always further than: using UTF-8 for everything. That should be the minimum and you can filter out or transcode as needed later. Simply trusting system defaults just doesn't cut it.

UTF-8 has been around for so long, but it's not as ubiquitous as you'd expect and so the UTF-8 Everywhere manifesto is still very relevant. If you're reading this you probably don't need to be convinced, but you can still make sure you know the absolute minimum every software developer should know about Unicode and character sets.

Encoding Java source files in UTF-8

Save your code in UTF-8. Update your Eclipse workspace and project settings and default settings in NetBeans. Coding in UTF-8 is useful even for the simple fact that you can see the ACTUAL characters in your literals instead of escaped code points like "\uD83D\uDCA9". Your standard Properties files unfortunately... will not work out of the box, but you can use Spring's PropertiesFactoryBean (that let's you specify an encoding), use this neat little workaround, or if you really want to keep it old-school: fiddle with the native2ascii tool that comes with the JDK.

Don't forget about the Java compiler, you also have to let it know about your 'futuristic' plans of supporting Unicode by adding the command-line option -encoding UTF-8. Maven has a plugin config option for that.

Git doesn't care about your puny encoding as it's encoding agnostic. Well... except for your commit messages. And Subversion uses UTF-8 in your repository already, so you're good to go SCM-wise in both cases.

Handling Strings in you application

It helps to NOT think of a Java String as an array of bytes, but rather an immutable sequence of Characters ranging from 0x0000 to 0xFFFF... because that's what they actually are. Make sure when grabbing characters from a string or working with indexes inside a String, you grab code points when necessary. The difference will matter when you find yourself using regular expressions for matching white-spaces when a white-space is not just a white-space, or transformations that remove diacritics.

You say your application isn't localized?? Oh, in that case... DO IT ANYWAY! You live in a world with emojis and there is no turning back time ⏰. There is no telling WHAT users will send your way.

User input, HTML forms and Content-Type

Anywhere bytes get interpreted to Java Strings, you need to consider the encoding. The most obvious place for web-developers is when dealing with web clients. You should honor the browser's request, right? However, that means they could request a certain encoding by adding the HTTP header Accept-Charset. The Servlet spec doesn't really handle that scenario for you. So, if you use JSP's, what you put in the content type page directive is what goes. Or if you use some other view technology, what you set in the Content-Type header of your response, does not (or should not) get transcoded by the container. Since HTML is human readable text, it would be cleaner if you just used a Writer and the container would handle the choice of encoding.

Browsers will generally use the encoding of a page to determine which encoding to send. So you can't always assume the encoding of form submits at the server side, but you can be pretty confident about it when you send everything in UTF-8, so that's what you should do. IMHO you can ignore the Accept-Charset scenario due to how uncommon it is, but think about the type of web clients you'll be facing first. Since you're already sending everything in UTF-8, you'll be getting it back like this also. You can add hints to your forms to be extra sure.

For interpreting requests WITHOUT any charset info, you can default to UTF-8 with Spring's CharacterEncodingFilter, which should be applied as the first filter just to be safe. I wouldn't use the force option, but it could be useful, I suppose. Now what about your URL? Do you use path variables? Do you have URL encoded query strings? Of course you do, but guess what... the Servlet spec doesn't cover it. No biggy, all popular application containers let you configure this. You may also need to take extra care when using URL-rewriting or reverse-proxies (Apache, Nginx, IIS etc.). So many things to consider!

Even if YOU do everything right, others may not. If you rely on external services, it helps to use Commons HttpClient to have easy access to the lower level workings. You can more easily intervene when services respond with gibberish. Sometimes you get sent a BOM. The Byte Order Marker tells you the endianness of your UTF-8 and it's completely valid. Unfortunately, a lot of libraries don't handle it very well. Here's a nice class that handles the BOM's transparently.

Your precious data

You may think that you own your data, but your storage medium owns it. MySQL ships with latin1 as its default character set. Your config file has UTF-8 settings that you can uncomment. Well that was easy! No it's not. MySQL's utf8 charset is actually not standard's compliant as it only allows up to 3 bytes per character. You'll actually have to use utf8mb4.

I could list other databases and go on and on, but I think I've made the point I was trying to make. Your application talks to other applications, and content encoding happens several times during one request cycle. I like investigating encoding issues, because so many people don't know enough about it that I feel like it's a special skill of some kind. It shouldn't be, and the fact that it feels that way is a bad sign.