Stupid UNIX Tricks #1 : LANG and shell scripts

If you’ve been using UNIX systems for a while (including Mac OS X, Linux or anything else remotely similar) you might know about the LANG environment variable. It’s used to select how your computer treats language-specific features. You can find out more than you ever wanted to know by looking here: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html

Mostly it doesn’t make much difference in your life, except there are two commonly used default settings. One common setting is LANG=C which enables some very old-fashioned standard-conforming details and allows an implementation to skip lots of fancy language processing code. Another common setting is LANG=en_US.UTF-8. That setting tells the various system functions in libc to expect strings to be in a rich text format.

On the systems I use, it seems like the default is en_US.UTF-8. But I suspect that most people must have LANG=C somewhere in their 20-year old .login files, because I occasionally run into bugs where some script doesn’t work right unless you have LANG=C.

Here’s an example:

% mkdir test; cd test; touch Caa cbb
% export LANG=C
% echo [c-d]*
cbb
% export LANG=en_US.UTF-8
% echo [c-d]*
Caa cbb

So the range of characters from ‘c’ to ‘d’ includes the letter ‘C’ if you are in the en_US.UTF-8 locale. Ugh. It’s easy to get that wrong in your shell script someplace, and people do.

Here’s an easier way to show why that happens:

% mkdir test1; cd test1; touch a A b B c C;
% export LANG=C
% ls
A  B  C  a  b  c
% export LANG=en_US.UTF-8
% ls
a  A  b  B  c  C

So you can see the sort order of strings used by the ls command matches the character order that the shell uses to expand the character range construct of glob regular expressions. I suppose it’s consistent. But it’s one of the things that makes it a challenge to write shell scripts that are robust and portable to different user’s environments.