file-not-utf8 complaints

List overview All Threads
Download

newer

older

clarification about the DraftsTodo...

eclipse plugin without feature.xml

Jason L Tibbitts III

30 May 2008 30 May '08

8:23 p.m.

Normally we fix up non-utf8 documentation and such with a quick call to iconv. It seems that this is problematic for some; see https://bugzilla.redhat.com/show_bug.cgi?id=226079

Any comments on how much we actually care about this, especially in the case that it might not actually be as easy as a call to iconv (such as a changelog file with a pile of random encodings in it).

- J<

Show replies by date

Toshio Kuratomi

30 May 30 May

8:56 p.m.

Jason L Tibbitts III wrote:

...

Normally we fix up non-utf8 documentation and such with a quick call to iconv. It seems that this is problematic for some; see https://bugzilla.redhat.com/show_bug.cgi?id=226079

Any comments on how much we actually care about this, especially in the case that it might not actually be as easy as a call to iconv (such as a changelog file with a pile of random encodings in it).

Well... The reason that all files must be UTF-8 is exactly the problem that the ChangeLog exhibits so I don't have a lot of sympathy there. The names and special characters in that file are already corrupted since there's no common encoding and none is recorded with the names. Dropping it from the package, as Daniel expressed is certainly an option as there's no requirement that ChangeLogs need to be in a package and it is not something that must be changed.

Reencoding the xml files that specify an encoding isn't strictly necessary. We should probably ask upstream whether they are amenable to changing to utf-8. Since libxml2 deals with utf-8 internally and the upstream author made a nice writeup about why he made that choice, upstream might be amenable to that. If upstream is not amenable, we should consider changing the Packaging Guidelines to reflect that xml files which specify their encoding do not have to be re-encoded utf-8. (Although we then have to ask ourselves if we should be checking that the xml files actually use the encoding that they specify :-(

NEWS and other files that are neither specifying an encoding nor mixed up in such a way that they are hopelessly corrupted WRT the original characters should definitely be converted to utf-8. If Daniel wants to hold open the Merge Review until that has gone in upstream, that is his perogative.

The most chilling aspect of that review is that the maintainer does not seem to think that it's his responsibility to take issues with the upstream source to upstream. Since Daniel is upstream, I'm not certain I can see why he feels that someone else should be reporting it upstream before he deals with it.

-Toshio

Hans de Goede

31 May 31 May

1:06 a.m.

Toshio Kuratomi wrote:

...

Jason L Tibbitts III wrote:

...
Normally we fix up non-utf8 documentation and such with a quick call to iconv. It seems that this is problematic for some; see https://bugzilla.redhat.com/show_bug.cgi?id=226079

Any comments on how much we actually care about this, especially in the case that it might not actually be as easy as a call to iconv (such as a changelog file with a pile of random encodings in it).

Well... The reason that all files must be UTF-8 is exactly the problem that the ChangeLog exhibits so I don't have a lot of sympathy there.

+1,

Although I fully agree with Daniel that blindly converting text-ish files which actually specify an encoding in their headers is both wrong and dangerous as that actually breaks stuff, normal text files, esp. ones in %doc should be in UTF-8, so that when opened they display correctly.

Indeed the changelog is a perfect example of why all plain text files must be UTF-8, had it always been UTF-8 the problems between part being in west-european encoding and parts in east-european encoding would not exist.

Also I think its worth noting that Fedora is not the only distro doing this, Debian for example also tries to have all text files in the distro in UTF-8.

I'll also put a comment to this extend in the review.

Regards,

Hans

The

...

names and special characters in that file are already corrupted since there's no common encoding and none is recorded with the names. Dropping it from the package, as Daniel expressed is certainly an option as there's no requirement that ChangeLogs need to be in a package and it is not something that must be changed.

Reencoding the xml files that specify an encoding isn't strictly necessary. We should probably ask upstream whether they are amenable to changing to utf-8. Since libxml2 deals with utf-8 internally and the upstream author made a nice writeup about why he made that choice, upstream might be amenable to that. If upstream is not amenable, we should consider changing the Packaging Guidelines to reflect that xml files which specify their encoding do not have to be re-encoded utf-8. (Although we then have to ask ourselves if we should be checking that the xml files actually use the encoding that they specify :-(

NEWS and other files that are neither specifying an encoding nor mixed up in such a way that they are hopelessly corrupted WRT the original characters should definitely be converted to utf-8. If Daniel wants to hold open the Merge Review until that has gone in upstream, that is his perogative.

The most chilling aspect of that review is that the maintainer does not seem to think that it's his responsibility to take issues with the upstream source to upstream. Since Daniel is upstream, I'm not certain I can see why he feels that someone else should be reporting it upstream before he deals with it.

-Toshio

-- Fedora-packaging mailing list Fedora-packaging@redhat.com https://www.redhat.com/mailman/listinfo/fedora-packaging

Patrice Dumas

4:37 a.m.

On Fri, May 30, 2008 at 06:56:33PM -0700, Toshio Kuratomi wrote:

...

...
Reencoding the xml files that specify an encoding isn't strictly necessary. We should probably ask upstream whether they are amenable to

I think that reencoding files that carry over the encoding information (info, texinfo, tex and xml for example) is wrong. It is better to let upstream do whatever they want. Same for examples of code, better leave the encoding preferred by upstream.

For NEWS/Changelog, other text files in %doc and also man pages that are not installed in a non utf8 locale, I agree that converting to UTF-8 is better.

-- Pat

Toshio Kuratomi

6:09 p.m.

Patrice Dumas wrote:

...

On Fri, May 30, 2008 at 06:56:33PM -0700, Toshio Kuratomi wrote:

...
Reencoding the xml files that specify an encoding isn't strictly necessary. We should probably ask upstream whether they are amenable to

I think that reencoding files that carry over the encoding information (info, texinfo, tex and xml for example) is wrong. It is better to let upstream do whatever they want. Same for examples of code, better leave the encoding preferred by upstream.

For NEWS/Changelog, other text files in %doc and also man pages that are not installed in a non utf8 locale, I agree that converting to UTF-8 is better.

I'm almost in complete agreement with you. The one extra piece that I think should be considered is how the text is normally viewed/edited.

For instance, if a program has a plain text data file and the program expects the data to be encoded in utf-16 that should stay utf-16. Since the end user never views the file and the program has an expectation of what's in it, this should be perfectly acceptable.

However, the flipside of this is if a program has an xml config file that the user is expected to edit manually in a text editor and the program will adapt to multiple encodings (for instance, by using libxml2 to parse the file[1]_) having it exist in utf-8 is much better than having it exist in SOME_EXOTIC_ENCODING. In this case it's the program that doesn't care that the config file is in utf-8 vs SHIFT-JIS. But the user that opens the file in a text editor will be presented with garbage if the text does not match the system default encoding. Yes, the user can manually change the encoding that is displayed and saved in some editors but:

1) This is not the full range of editors.

2) The user has to learn to enable the new encoding in their editor. This involves reading, editing, and saving. Some editors will display garbage unless you set the correct encoding on startup, others can change while running; some convert on open with a best guess at what the bytes mean but you have to specify what encoding to save the result otherwise you get the default (utf-8 or dependent on your locale settings).

3) If the user wants to use characters that are not present in the encoding the file is written in (for instance, the file is encoded in KOI8-R but the user wants to use kanji.) They'll have to convert the file to a unicode family of encodings and edit the header that tells the character set to use before making their changes.

So really, whether the user is intended to edit/view the file directly instead of through a program that can change the encoding appropriately should be the dividing line rather than whether the format specifies the encoding/does not specify encoding.

.. _[1]: http://xmlsoft.org/encoding.html#Default

Whether this is something we should do in our packages even if upstream doesn't accept the changes involves other factors. In the case of documentation files that have no encoding we should convert whether or not upstream agrees. In the case of documentation that does specify the encoding I lean towards converting [2]_. In the case of a file that is used by a program we should definitely have a conversation with upstream about it, although we could convert locally with upstream's blessing (ie: Upstream says: "I'm going to continue writing my xml config file in latin-1. If you want to convert them to utf-8 for your users that's fine -- I'm going to continue to use a library for xml parsing that understands encodings.")

.. _[2]: Note that this is only for documentation which is not supposed to be viewed directly. xhtml, for instance, is normally going to be viewed in a browser so this would not apply.

-Toshio

Patrice Dumas

1 Jun 1 Jun

3 a.m.

On Sat, May 31, 2008 at 04:09:25PM -0700, Toshio Kuratomi wrote:

...

However, the flipside of this is if a program has an xml config file that the user is expected to edit manually in a text editor and the program will adapt to multiple encodings (for instance, by using libxml2 to parse the file[1]_) having it exist in utf-8 is much better than having it exist in SOME_EXOTIC_ENCODING. In this case it's the program

I disagree. It is not an obvious choice and should be left to the maintainer. It depends on the user target of the software, for instance.

...

not upstream agrees. In the case of documentation that does specify the encoding I lean towards converting [2]_. In the case of a file that is used by a program we should definitely have a conversation with upstream about it, although we could convert locally with upstream's blessing (ie: Upstream says: "I'm going to continue writing my xml config file in latin-1. If you want to convert them to utf-8 for your users that's fine -- I'm going to continue to use a library for xml parsing that understands encodings.")

Once again, better leave it to the maintainer. This doesn't prevent from issuing recommendations, though.

-- Pat

Toshio Kuratomi

12:17 p.m.

Patrice Dumas wrote:

...

On Sat, May 31, 2008 at 04:09:25PM -0700, Toshio Kuratomi wrote:

...
However, the flipside of this is if a program has an xml config file that the user is expected to edit manually in a text editor and the program will adapt to multiple encodings (for instance, by using libxml2 to parse the file[1]_) having it exist in utf-8 is much better than having it exist in SOME_EXOTIC_ENCODING. In this case it's the program

I disagree. It is not an obvious choice and should be left to the maintainer. It depends on the user target of the software, for instance.

Please state your counter example. I'm laying out the parameters by which we could relax the current rule. If we don't lay out the boundaries correctly the replacement rule will end up still being too restrictive.

-Toshio

Patrice Dumas

2:24 p.m.

On Sun, Jun 01, 2008 at 10:17:32AM -0700, Toshio Kuratomi wrote:

...

Patrice Dumas wrote:

...
On Sat, May 31, 2008 at 04:09:25PM -0700, Toshio Kuratomi wrote:

...
However, the flipside of this is if a program has an xml config file that the user is expected to edit manually in a text editor and the program will adapt to multiple encodings (for instance, by using libxml2 to parse the file[1]_) having it exist in utf-8 is much better than having it exist in SOME_EXOTIC_ENCODING. In this case it's the program

I disagree. It is not an obvious choice and should be left to the maintainer. It depends on the user target of the software, for instance.

Please state your counter example. I'm laying out the parameters by which we could relax the current rule. If we don't lay out the boundaries correctly the replacement rule will end up still being too restrictive.

I may be wrong, but it seems to me that there is no current rule? Except that rpmlint warning/errors should be handled if possible, but there is nothing about that in the guidelines (spec file and filename should be utf8, though).

Here is a wording that would seem right to me:

Files that don't carry information about their encoding should be converted to UTF-8. It is typically useful for NEWS files with author names with acceented characters. There may be exceptions, for example a README.cn file written in chinese may be encoded in a popular chinese encoding like Big5.

Files that carry over their encoding (xml, tex, info...) may also be converted to UTF-8, but the decision is left to the package maintainer. It may be especially relevant for files that are to be edited by the user, since it may be difficult to edit a file not in UTF-8, while UTF-8 should be handled by most editors automatically, as the default for fedora is an UTF-8 locale.

-- Pat

Toshio Kuratomi

3:18 p.m.

Patrice Dumas wrote:

...

On Sun, Jun 01, 2008 at 10:17:32AM -0700, Toshio Kuratomi wrote:

...
Patrice Dumas wrote:

...
On Sat, May 31, 2008 at 04:09:25PM -0700, Toshio Kuratomi wrote:

...
However, the flipside of this is if a program has an xml config file that the user is expected to edit manually in a text editor and the program will adapt to multiple encodings (for instance, by using libxml2 to parse the file[1]_) having it exist in utf-8 is much better than having it exist in SOME_EXOTIC_ENCODING. In this case it's the program

I disagree. It is not an obvious choice and should be left to the maintainer. It depends on the user target of the software, for instance.

Please state your counter example. I'm laying out the parameters by which we could relax the current rule. If we don't lay out the boundaries correctly the replacement rule will end up still being too restrictive.

I may be wrong, but it seems to me that there is no current rule? Except that rpmlint warning/errors should be handled if possible, but there is nothing about that in the guidelines (spec file and filename should be utf8, though).

My bad, I must have been recalling the debates over the filename's must be utf-8 guideline. If there's no current guideline then I'm not sure we need a new one.

...

Here is a wording that would seem right to me:

Files that don't carry information about their encoding should be converted to UTF-8. It is typically useful for NEWS files with author names with acceented characters. There may be exceptions, for example a README.cn file written in chinese may be encoded in a popular chinese encoding like Big5.

I could go either way on this but lean towards this should be utf-8. ShiftJS, Big5, etc have benefits over UTF-8 and the people who use those are the consumers of this file. OTOH, for Fedora to truly support the UTF-8 locale out of the box, these kinds of files (which don't specify an encoding and aren't used by the program) have to be UTF-8. How can we ship with a UTF-8 locale by default knowing that the README.cn isn't readable by people who stick with our default?

...

Files that carry over their encoding (xml, tex, info...) may also be converted to UTF-8, but the decision is left to the package maintainer. It may be especially relevant for files that are to be edited by the user, since it may be difficult to edit a file not in UTF-8, while UTF-8 should be handled by most editors automatically, as the default for fedora is an UTF-8 locale.

This part seems quite reasonable as a recommendation.

-Toshio

Patrice Dumas

4:42 p.m.

On Sun, Jun 01, 2008 at 01:18:02PM -0700, Toshio Kuratomi wrote:

...

...
My bad, I must have been recalling the debates over the filename's must be utf-8 guideline. If there's no current guideline then I'm not sure we need a new one.

Agreed.

...

I could go either way on this but lean towards this should be utf-8. ShiftJS, Big5, etc have benefits over UTF-8 and the people who use those are the consumers of this file. OTOH, for Fedora to truly support the UTF-8 locale out of the box, these kinds of files (which don't specify an encoding and aren't used by the program) have to be UTF-8. How can we ship with a UTF-8 locale by default knowing that the README.cn isn't readable by people who stick with our default?

Once again I think that it really depends on the package user community. I don't know anything about asian encodings, but if the packager thinks that users anticipate a file encoded in Big5 he could leave it in the original encoding.

-- Pat

Toshio Kuratomi

2 Jun 2 Jun

5:52 p.m.

Patrice Dumas wrote:

...

On Sun, Jun 01, 2008 at 01:18:02PM -0700, Toshio Kuratomi wrote:

...

...
I could go either way on this but lean towards this should be utf-8. ShiftJS, Big5, etc have benefits over UTF-8 and the people who use those are the consumers of this file. OTOH, for Fedora to truly support the UTF-8 locale out of the box, these kinds of files (which don't specify an encoding and aren't used by the program) have to be UTF-8. How can we ship with a UTF-8 locale by default knowing that the README.cn isn't readable by people who stick with our default?

Once again I think that it really depends on the package user community. I don't know anything about asian encodings, but if the packager thinks that users anticipate a file encoded in Big5 he could leave it in the original encoding.

I think you're wrong on this. We ship with a default locale choice that has utf-8 as the encoding. If the user is changing the default, they can't expect things to work. More importantly, if the user is leaving the defaults alone, they should be able to expect things to work. The choice of UTF-8 encoding for a locale is something we do at the distribution level. Everything we ship in the distribution should work with that choice.

For files that do not specify this encoding, there's no way to satisfy that requirement except to ship them as UTF-8.

-Toshio

Ville Skyttä

10:26 a.m.

On Saturday 31 May 2008, Patrice Dumas wrote:

...

On Fri, May 30, 2008 at 06:56:33PM -0700, Toshio Kuratomi wrote:

...
Reencoding the xml files that specify an encoding isn't strictly necessary. We should probably ask upstream whether they are amenable to

I think that reencoding files that carry over the encoding information (info, texinfo, tex and xml for example) is wrong. It is better to let upstream do whatever they want.

I agree with Toshio on this one, IMO it's not necessarily wrong. Anyway wrt. XML files, upstreams probably wouldn't mind a friendly reminder that the only encodings conformant XML processors are required to support are UTF-8 and UTF-16.

Patrice Dumas

3:35 p.m.

On Mon, Jun 02, 2008 at 06:26:03PM +0300, Ville Skyttä wrote:

...

On Saturday 31 May 2008, Patrice Dumas wrote:

...
On Fri, May 30, 2008 at 06:56:33PM -0700, Toshio Kuratomi wrote:

...
Reencoding the xml files that specify an encoding isn't strictly necessary. We should probably ask upstream whether they are amenable to

I think that reencoding files that carry over the encoding information (info, texinfo, tex and xml for example) is wrong. It is better to let upstream do whatever they want.

I agree with Toshio on this one, IMO it's not necessarily wrong. Anyway wrt.

I don't disagree. I think it should be left to the maintainer.

-- Pat

5806

Age (days ago)

5808

Last active (days ago)

packaging@lists.fedoraproject.org

12 comments

5 participants

tags (0)

participants (5)

Hans de Goede
Jason L Tibbitts III
Patrice Dumas
Toshio Kuratomi
Ville Skyttä