Adding a C.UTF-8 locale to our glibc packages - glibc - Fedora mailing-lists

16 Sep 2015


      This patch adds a C.UTF-8 locale as a folder /usr/lib/locale/C.utf8/
to our glibc packages.
This way, the locale is sort of “unremovable” because it is not
affected by the --install-langs option of build-locale-archive.
build-locale-archive completely ignores folders in /usr/lib/locale/
which do not have a “_” in their name.
This is similar to how Debian did it.
I added the LC_* sections which are missing in the Debian source
though because when sections are missing, localedef prints a warning
and I don’t want to use the “-c” option to force the output in spite
of the warnings.
This locale is very close in behaviour to C/POSIX but LC_CTYPE is just
copying from glibc/localedata/locales/i18n (which we have just updated
for Unicode 8.0.0).
So this C.UTF-8 gives us a locale which is very much like the C locale
but uses UTF-8 encoding and tools like "ls" will display all printable
characters from Unicode instead of displaying question marks for
everything non-ASCII.
Sorting (LC_COLLATE) is done strictly via Unicode code point order which
gives the same sorting for the ASCII range as the traditional C locale.
Debian does this like this:
LC_COLLATE
order_start forward
<U0000>
<U0001>
all code points listed individually leaving out the unassigned ranges
<U10FFFE>
<U10FFFF>
UNDEFINED
order_end
END LC_COLLATE
(more than 300000 lines of code points listed).
I used this instead:
LC_COLLATE
order_start forward
<U0000>
..
<UFFFF>
<U10000>
..
<U1FFFF>
<U20000>
..
<U2FFFF>
<UE0000>
..
<UEFFFF>
<UF0000>
..
<UFFFFF>
<U100000>
..
<U10FFFF>
UNDEFINED
order_end
END LC_COLLATE
Which makes the source much shorter and more readable, the result
in the binary is the same, the size of the binary is the same as well,
the complete binary locale needs about 1.8M almost all because
of LC_COLLATE (same on Debian).
Not skipping the ranges currently not assigned in Unicode would make
the locale about 6.5M big. This seems theoretically better to me
but it has probably little benefit sorting unassigned code points
by code point order as well.
Actually I think this should be enough:
LC_COLLATE
order_start forward
UNDEFINED
order_end
END LC_COLLATE
because of:
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
opengroup> The symbol UNDEFINED shall be interpreted as including all coded
opengroup> character set values not specified explicitly or via the ellipsis
opengroup> symbol. Such characters shall be inserted in the character collation
opengroup> order at the point indicated by the symbol, and in ascending order
opengroup> according to their coded character set values. If no UNDEFINED symbol is
opengroup> specified, and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a warning message and
opengroup> place such characters at the end of the character collation order.
But unfortunately it does not work like that in glibc, which is probably
a bug.
I tried to look at the code to find out why UNDEFINED does not work as
specified in the standard but could not figure it out yet.
(UNDEFINED currently does not work at all as specified, most locale
sources have UNDEFINED somewhere, but the characters not specified
explicitly do not get inserted where UNDEFINED is but instead right at
the top (before all other charcters).
-- 
Mike FABIAN mfabian@redhat.com
☏ Office: +49-69-365051027, internal 8875027
睡眠不足はいい仕事の敵だ。