Matt Walker
2015-12-18 17:12:17 UTC
Hi everyone,
I noticed that xmonad and xmonad-contrib both prefer to use of String =
[Char] for their stringy data-types. This is probably a terrible idea. I
cite some sources here, then outline their arguments below.
http://www.alexeyshmalko.com/2015/haskell-string-types/
https://mail.haskell.org/pipermail/haskell-cafe/2014-June/114745.html
__String is Bad__
1) Char is a horribly inefficient representation of a character, being an
entire machine word in length (at least 32 bits). Actually, it's worse:
each Char takes up _two_ machine words in GHC, since it needs one to store
GC information in. See the slide in the first link for more details.
Data.Text stores the characters in compact arrays.
2) Lists are lazy, which makes their evaluation slower. You have to thunk
on each character, which is pretty silly most of the time. Normally you
want to read in at least _chunks_ of string all at once. Data.Text is
strict, but Data.Text.Lazy exists and is (as you would assume) lazy when
you need it.
The long and the short of it is that [Char] is a suboptimal choice to use
for anything except possibly short identifiers; Haskell (via GHC) is a
compiled language, and yet performs orders of magnitudes worse than even
Perl and Python on text processing when using Data.String. There is simply
no good reason to use String when Text exists.
__Alternatives__
The other alternative is ByteString. Although ByteString is a great type
for binary data, and specifically for data exchange protocols, it seems
that it would inappropriate in this situation, due to the replacement of
most (if not all) instances being actual textual data, which obviously Text
is optimized for.
__Migration Issues__
Assuming we can agree that Text > String then, the main problem to
switching would be the pain of migration, and whether this would be worth
it. I argue it wouldn't be so bad, and is worth doing on principle alone.
The LANGUAGE pragma of OverloadedStrings allows you to use String literals
as Text literals, so that wouldn't be the main problem. The main issue is
changing all the interfaces so they accept Text instead of String, and how
this would impact existing user configs, and the xmonad-contrib archive.
Every time you use ++ you would have to replace it with <>, the Monoid
infix mappend operator. I doubt many people use : to build Strings, but in
those instances those would have to be changed too. Finally, pattern
matching on Strings like (x:xs) would break as well. All other functions
would require changing from their String/List counterpart to the Text one.
Since the names clash, one would have to import qualified as, for instance,
T and call T.intersperse or whatever. It would be a non-trivial
undertaking, but certainly doable.
__Other Breaking Changes__
Are there other niggling issues that exist in the codebase that would cause
breaking changes? Perhaps it would be a good idea to get a list of them
all and see if it's worth breaking backwards compatibility to fix them all
at once? I'm a purist when it comes to code, but I would like to hear what
other people think, and just how angry they would be with this change. I
have no idea as to what xmonad and xmonad-contrib's breaking changes policy
is.
Obviously I'm not proposing this change be undertaken for 0.13 -- I was
aiming for more 0.14 or later.
Let me know what you think.
Sincerely,
Matt
I noticed that xmonad and xmonad-contrib both prefer to use of String =
[Char] for their stringy data-types. This is probably a terrible idea. I
cite some sources here, then outline their arguments below.
http://www.alexeyshmalko.com/2015/haskell-string-types/
https://mail.haskell.org/pipermail/haskell-cafe/2014-June/114745.html
__String is Bad__
1) Char is a horribly inefficient representation of a character, being an
entire machine word in length (at least 32 bits). Actually, it's worse:
each Char takes up _two_ machine words in GHC, since it needs one to store
GC information in. See the slide in the first link for more details.
Data.Text stores the characters in compact arrays.
2) Lists are lazy, which makes their evaluation slower. You have to thunk
on each character, which is pretty silly most of the time. Normally you
want to read in at least _chunks_ of string all at once. Data.Text is
strict, but Data.Text.Lazy exists and is (as you would assume) lazy when
you need it.
The long and the short of it is that [Char] is a suboptimal choice to use
for anything except possibly short identifiers; Haskell (via GHC) is a
compiled language, and yet performs orders of magnitudes worse than even
Perl and Python on text processing when using Data.String. There is simply
no good reason to use String when Text exists.
__Alternatives__
The other alternative is ByteString. Although ByteString is a great type
for binary data, and specifically for data exchange protocols, it seems
that it would inappropriate in this situation, due to the replacement of
most (if not all) instances being actual textual data, which obviously Text
is optimized for.
__Migration Issues__
Assuming we can agree that Text > String then, the main problem to
switching would be the pain of migration, and whether this would be worth
it. I argue it wouldn't be so bad, and is worth doing on principle alone.
The LANGUAGE pragma of OverloadedStrings allows you to use String literals
as Text literals, so that wouldn't be the main problem. The main issue is
changing all the interfaces so they accept Text instead of String, and how
this would impact existing user configs, and the xmonad-contrib archive.
Every time you use ++ you would have to replace it with <>, the Monoid
infix mappend operator. I doubt many people use : to build Strings, but in
those instances those would have to be changed too. Finally, pattern
matching on Strings like (x:xs) would break as well. All other functions
would require changing from their String/List counterpart to the Text one.
Since the names clash, one would have to import qualified as, for instance,
T and call T.intersperse or whatever. It would be a non-trivial
undertaking, but certainly doable.
__Other Breaking Changes__
Are there other niggling issues that exist in the codebase that would cause
breaking changes? Perhaps it would be a good idea to get a list of them
all and see if it's worth breaking backwards compatibility to fix them all
at once? I'm a purist when it comes to code, but I would like to hear what
other people think, and just how angry they would be with this change. I
have no idea as to what xmonad and xmonad-contrib's breaking changes policy
is.
Obviously I'm not proposing this change be undertaken for 0.13 -- I was
aiming for more 0.14 or later.
Let me know what you think.
Sincerely,
Matt