Ersin Er wrote a brief blog post about handling the Turkish language in Haskell. Because Turkish uses a character set that mostly looks familiar to Westerners, it is notorious for its ability to trip up the unwary programmer (see examples in PHP and PostgreSQL).
1 |
|
His example is quite nice, but we can write more compact version of his code using a few handy features of the text and text-icu packages:
In the text-icu library, we use the
LocaleName
type to describe the locale in which we want a function to operate. This type is an instance of theIsString
class, so if we enable the OverloadedStrings language feature, we can write plain"tr-TR"
to specify a Turkish locale.The
Text
type is also an instance of theIsString
class, so we can write a literal string like"foo"
and the compiler will infer the correct type for it.The Data.Text.IO module contains functions for performing locale-sensitive I/O using
Text
values.
This combination of features can let us write a less cluttered program, following the dictum that simple things should be simple:
1 |
|
I've intentionally kept the number of lines the same to preserve clarity, but there are a few advantages to the rewrite:
Less clutter, more speed: we don't need to explicitly pack or unpack
Text
values to or fromString
values.Performance: we're not performing I/O on
String
values. This would be a big deal if we were writing a real application: I/O withText
is much faster than withString
.Putting inference to work: the compiler correctly infers the type of
"tr-TR"
to be aLocaleName
, and of the strings at the end to beText
, so we don't need to be so explicit.
Oh, and we still give the right answer (look carefully at upper and lower case dotted and dotless "I"):
toLower ÇIİĞÖŞÜ gives çıiğöşü
The full documentation to the text and text-icu libraries is a little difficult to read on Hackage (in fact, the text-icu API docs are completely missing), so here are links:
I’m glad that work like this is going on. One of the things I use programming for is linguistics, and despite the fact I’m having fun with Haskell, the previous lack of sophisticated unicode functionality was bugging me. In fact, I upgraded to GHC 6.12 purely to get better handling of unicode I/O from the standard I/O functions.
Until your recent posts about text packages (which I got to from the planet haskell blog aggregator) I’d been hunting for a way to do things like normalisation of unicode for a while. It’s just so important now for any major programming language to have good unicode support.
I hope this will be part of the next Haskell Platform release in january.