The unicode consortium's solution to encoding issues: one encoding for all text in the world. Ascii compatible, even latin-1 compatible. The often used utf-8 encoding encodes unicode into 8-bit "code units". Python supports it all (that is: unicode 3.2 support in python 2.3 and 2.4, unicode 4.1 is supported by python 2.5).
Python's native unicode type is very efficient and its performance is equal to or better than normal string processing. There are a lot of codecs to encode unicode into latin-1 and so.
Using unicode in python has problems. Not all modules expect unicode, you've got to encode your data then. Some operating systems also cause problems, especially for filenames.
General principle: use unicode for all your text inside your application. Avoid mixing unicode and strings. So use explicit encoding/decoding in all I/O operation.
Internationalisation (i18n) approach:
u"my text in my default
language"
. And: enclose all literals in a call to a translation function:
translate(u"my text")
or, an often used alternative, using an
underscore: _(u"my text")
.
The most often used tool is GNU's gettext, available through the python gettext module. There are lots of tools for it. Egenix (which Marc-André is the boss of) do it a bit differently with an on-the-fly approach with translations stored in a database.
Internationalisation is hard, there are multiple things to keep in mind. When displaying a few dropdowns for type, date and time in a sentence form for an appointment application, you have to remember that the sentence order changes per application. Sort order varies per language. a-z is clear, but where does ä go?
Chandler, the case study in his presentation, is a PIM (personal information management) application that wanted lots of good i18n and l10n. Chandler 0.5 only supported ascii, had hardcoded date format strings, everything was English, etc. A typical US-originating open source application :-)
ICU is a mature set of c/c++ and java libraries for unicode support and internationalisation. Unicode text handling, unicode regular expressions, date formatting, locale-dependent sorting, etc. Looked OK. So in May 2005 they added python bindings to the c++ ICU libraries by using SWIG (a wrapper generator). They have a hand-coded leaner wrapper now, which is quicker.
They only wrapped the parts of ICU that they needed themselves for chandler, btw.
There's one ICU part that they don't use and that's ICU's translation
mechanism, they used the gettext mechanism instead. It is used a lot
in the open source world. A handy method to deal with message strings
that include %s
replacements is to use the dictionary lookup trick:
instead of "%s items of %s found"
use "%(count)s items of %(type)s
found" % {"count": 5, "type": "talks"}
(oh, and then use unicode).
About the infamous "UnicodeDecodeError": Always do the conversion at the I/O boundaries. He said it, Marc-André said it: do it.
Chandler has a lot of I/O boundaries. http, webdav, filesystem, etc.
Nice idea: they're planning to provide their translations via python eggs.
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):