This patch adds a C.UTF-8 locale as a folder /usr/lib/locale/C.utf8/ to our glibc packages.
This way, the locale is sort of “unremovable” because it is not affected by the --install-langs option of build-locale-archive. build-locale-archive completely ignores folders in /usr/lib/locale/ which do not have a “_” in their name.
This is similar to how Debian did it.
I added the LC_* sections which are missing in the Debian source though because when sections are missing, localedef prints a warning and I don’t want to use the “-c” option to force the output in spite of the warnings.
This locale is very close in behaviour to C/POSIX but LC_CTYPE is just copying from glibc/localedata/locales/i18n (which we have just updated for Unicode 8.0.0).
So this C.UTF-8 gives us a locale which is very much like the C locale but uses UTF-8 encoding and tools like "ls" will display all printable characters from Unicode instead of displaying question marks for everything non-ASCII.
Sorting (LC_COLLATE) is done strictly via Unicode code point order which gives the same sorting for the ASCII range as the traditional C locale.
Debian does this like this:
LC_COLLATE order_start forward <U0000> <U0001>
all code points listed individually leaving out the unassigned ranges
<U10FFFE> <U10FFFF> UNDEFINED order_end END LC_COLLATE
(more than 300000 lines of code points listed).
I used this instead:
LC_COLLATE order_start forward <U0000> .. <UFFFF> <U10000> .. <U1FFFF> <U20000> .. <U2FFFF> <UE0000> .. <UEFFFF> <UF0000> .. <UFFFFF> <U100000> .. <U10FFFF> UNDEFINED order_end END LC_COLLATE
Which makes the source much shorter and more readable, the result in the binary is the same, the size of the binary is the same as well, the complete binary locale needs about 1.8M almost all because of LC_COLLATE (same on Debian).
Not skipping the ranges currently not assigned in Unicode would make the locale about 6.5M big. This seems theoretically better to me but it has probably little benefit sorting unassigned code points by code point order as well.
Actually I think this should be enough:
LC_COLLATE order_start forward UNDEFINED order_end END LC_COLLATE
because of:
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
opengroup> The symbol UNDEFINED shall be interpreted as including all coded opengroup> character set values not specified explicitly or via the ellipsis opengroup> symbol. Such characters shall be inserted in the character collation opengroup> order at the point indicated by the symbol, and in ascending order opengroup> according to their coded character set values. If no UNDEFINED symbol is opengroup> specified, and the current coded character set contains characters not opengroup> specified in this section, the utility shall issue a warning message and opengroup> place such characters at the end of the character collation order.
But unfortunately it does not work like that in glibc, which is probably a bug.
I tried to look at the code to find out why UNDEFINED does not work as specified in the standard but could not figure it out yet.
(UNDEFINED currently does not work at all as specified, most locale sources have UNDEFINED somewhere, but the characters not specified explicitly do not get inserted where UNDEFINED is but instead right at the top (before all other charcters).
On 09/16/2015 09:55 AM, Mike FABIAN wrote:
This patch adds a C.UTF-8 locale as a folder /usr/lib/locale/C.utf8/ to our glibc packages.
The patch looks good to me, but I don't see the changes where you add this to /usr/lib/locale/C.utf8? Did you miss adding a patch?
Thanks again for all the help Mike.
Note that on old systems where /usr is not pre-mounted, you won't have this locale, or any locales available. That's fine and expected, but just a point I wanted to make, and that we should keep in mind for non-builtin locales.
But unfortunately it does not work like that in glibc, which is probably a bug.
Please file a placeholder bug upstream for this, since it should work.
Please also file a bug for adding C.UTF-8 to upstream glibc, so we can reference this bug when doing the upstream port. Do not suggest an implemetnation though. When we push upstream we'll show that our implementation is simply to build another locale and install it.
Lastly, file a bug stating that the C and POSIX locales are not conforming because they include more than <space> and <tab> in <blank>. This is to cover our own work here, and show that it is an existing bug that more than these are in blank. I would also suggest the bug say "Alternatively we can file an issue with the Austin Group to adjust the text not to forbid more characters being included in blank."
I tried to look at the code to find out why UNDEFINED does not work as specified in the standard but could not figure it out yet.
(UNDEFINED currently does not work at all as specified, most locale sources have UNDEFINED somewhere, but the characters not specified explicitly do not get inserted where UNDEFINED is but instead right at the top (before all other charcters).
Cheers, Carlos.
"Carlos O'Donell" carlos@redhat.com さんはかきました:
On 09/16/2015 09:55 AM, Mike FABIAN wrote:
This patch adds a C.UTF-8 locale as a folder /usr/lib/locale/C.utf8/ to our glibc packages.
The patch looks good to me, but I don't see the changes where you add this to /usr/lib/locale/C.utf8? Did you miss adding a patch?
I forgot to include the changes in the glibc.spec file in the patch.
Here the complete patch attached.
Mike FABIAN mfabian@redhat.com さんはかきました:
This patch adds a C.UTF-8 locale as a folder /usr/lib/locale/C.utf8/ to our glibc packages.
We have that patch in Fedora 24 and rawhide already.
I would like to push it to Fedora 22 and Fedora 23 as well.
Attached is the patch for Fedora 23. The Patch for Fedora 22 is almost identical.
I tested this in qemu.
On 03/02/2016 11:07 AM, Mike FABIAN wrote:
Mike FABIAN mfabian@redhat.com さんはかきました:
This patch adds a C.UTF-8 locale as a folder /usr/lib/locale/C.utf8/ to our glibc packages.
We have that patch in Fedora 24 and rawhide already.
I would like to push it to Fedora 22 and Fedora 23 as well.
Attached is the patch for Fedora 23. The Patch for Fedora 22 is almost identical.
I tested this in qemu.
This looks good to me.
We've run in rawhide for quite a while with this patch and adding it to F22 and F23 is important to keep pushing the story that we have a C.UTF-8 locale that is always present and usable by applications.
Cheers, Carlos.