Python 3 porting and unicode

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Python 3 porting and unicode

wenzeslaus
Dear all,

after looking at different Python 2 to 3 porting issues, doing r71849, and reading #3392, I understand the following:

* Several solutions for poring exist. Most recent one is python-future project, but only from __future__ import ... is part of the library and thus guaranteed with recent Python 2.7. (We can discuss concrete steps separately.)

* However, the most challenging part of the porting will be the unicode.

* There is no way around the unicode when using Python 3. Unicode is inherent part of the language even things such as os.environ or sys.stdout.write() work only with unicode. I'm not sure what exactly the rule is here, but it seems to be everywhere.

* I haven't seen any simple fix which would limit the changes in the code in a way, e.g., in which print statement can be fixed.

* GUI will always use unicode because that's how the libraries and interfaces as set.

* In relation to the previous point, one of the reasons why unicode is used that thinks like text[:10] actually return 10 characters to display.

* C library will not use unicode for now.

* Users of the Python API who are using Python 3 will expect unicode strings to work, i.e. expect run_command('g.region', flags='p') to work (not just run_command(b'g.region', flags=b'p')).

* If Python libraries are unicode, there will need to be an interface to work with ctypes which would add to existing code for transferring from C world to Python and back.

* If Python libraries are bytes, there will need to be an interface to work with GUI in unicode as well as with users of the API who will expect unicode to work. In other words, internally it would use bytes, but interface must be both bytes (for modules and internal use) and unicode (for GUI and users).

* Having unicode-based library means encoding and decoding on any "external" interface such as file reading or ctypes.

* Having bytes-based library means encoding and decoding on any interface such as Python 3 interface such as os.environ and additionally rewriting all string literals ("abc") to bytes (b"abc").

* It seems hard to predict when we will know the right encoding of the text. It seems that we will need it with any solution since garbage-in-garbage stops when you need to use some system interface function in Python 3 which requires unicode. Although e.g. sys.stdout.write() has a (less generic) sys.stdout.buffer.write() alternative, os.environb does not work on MS Windows.

An example fix in r71849 is done using a (custom) decode function which creates unicode (standard string in Python3) when file content is read. Alternative to this change would be changing all the strings in the file to bytes (b'abc' as opposed to 'abc').

Please comment or link other related discussions.

Thanks,
Vaclav


python3 -c "import os; os.environ[b'abc'] = b'def'"
python3 -c "import os; os.environb[b'abc'] = b'def'"
python3 -c "import sys; sys.stdout.write(b'abc\n')"
python3 -c "import sys; sys.stdout.buffer.write(b'abc\n')"
python3 -c "import os; print(type(os.name))"

_______________________________________________
grass-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/grass-dev
Reply | Threaded
Open this post in threaded view
|

Re: Python 3 porting and unicode

Laurent C.
Hi Vaclav,

I think that it would make much more sense to have the GRASS python
libraries using unicode, and to add an interface managing the
translation to/from bytes when dealing with C code.
Python programmers using the GRASS libraries will expect unicode strings.

Laurent


2017-11-26 21:21 GMT-06:00 Vaclav Petras <[hidden email]>:

> Dear all,
>
> after looking at different Python 2 to 3 porting issues, doing r71849, and
> reading #3392, I understand the following:
>
> * Several solutions for poring exist. Most recent one is python-future
> project, but only from __future__ import ... is part of the library and thus
> guaranteed with recent Python 2.7. (We can discuss concrete steps
> separately.)
>
> * However, the most challenging part of the porting will be the unicode.
>
> * There is no way around the unicode when using Python 3. Unicode is
> inherent part of the language even things such as os.environ or
> sys.stdout.write() work only with unicode. I'm not sure what exactly the
> rule is here, but it seems to be everywhere.
>
> * I haven't seen any simple fix which would limit the changes in the code in
> a way, e.g., in which print statement can be fixed.
>
> * GUI will always use unicode because that's how the libraries and
> interfaces as set.
>
> * In relation to the previous point, one of the reasons why unicode is used
> that thinks like text[:10] actually return 10 characters to display.
>
> * C library will not use unicode for now.
>
> * Users of the Python API who are using Python 3 will expect unicode strings
> to work, i.e. expect run_command('g.region', flags='p') to work (not just
> run_command(b'g.region', flags=b'p')).
>
> * If Python libraries are unicode, there will need to be an interface to
> work with ctypes which would add to existing code for transferring from C
> world to Python and back.
>
> * If Python libraries are bytes, there will need to be an interface to work
> with GUI in unicode as well as with users of the API who will expect unicode
> to work. In other words, internally it would use bytes, but interface must
> be both bytes (for modules and internal use) and unicode (for GUI and
> users).
>
> * Having unicode-based library means encoding and decoding on any "external"
> interface such as file reading or ctypes.
>
> * Having bytes-based library means encoding and decoding on any interface
> such as Python 3 interface such as os.environ and additionally rewriting all
> string literals ("abc") to bytes (b"abc").
>
> * It seems hard to predict when we will know the right encoding of the text.
> It seems that we will need it with any solution since garbage-in-garbage
> stops when you need to use some system interface function in Python 3 which
> requires unicode. Although e.g. sys.stdout.write() has a (less generic)
> sys.stdout.buffer.write() alternative, os.environb does not work on MS
> Windows.
>
> An example fix in r71849 is done using a (custom) decode function which
> creates unicode (standard string in Python3) when file content is read.
> Alternative to this change would be changing all the strings in the file to
> bytes (b'abc' as opposed to 'abc').
>
> Please comment or link other related discussions.
>
> Thanks,
> Vaclav
>
>
> python3 -c "import os; os.environ[b'abc'] = b'def'"
> python3 -c "import os; os.environb[b'abc'] = b'def'"
> python3 -c "import sys; sys.stdout.write(b'abc\n')"
> python3 -c "import sys; sys.stdout.buffer.write(b'abc\n')"
> python3 -c "import os; print(type(os.name))"
> https://trac.osgeo.org/grass/changeset/71849
> https://trac.osgeo.org/grass/ticket/2708
> https://trac.osgeo.org/grass/ticket/3392
> https://trac.osgeo.org/grass/query?status=!closed&keywords=~python3
> https://trac.osgeo.org/grass/query?status=!closed&keywords=~encoding
> https://trac.osgeo.org/grass/query?status=!closed&keywords=~unicode
>
> _______________________________________________
> grass-dev mailing list
> [hidden email]
> https://lists.osgeo.org/mailman/listinfo/grass-dev
_______________________________________________
grass-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/grass-dev
Reply | Threaded
Open this post in threaded view
|

Re: Python 3 porting and unicode

Glynn Clements
In reply to this post by wenzeslaus

Vaclav Petras wrote:

> * There is no way around the unicode when using Python 3. Unicode is
> inherent part of the language even things such as os.environ or
> sys.stdout.write() work only with unicode. I'm not sure what exactly the
> rule is here, but it seems to be everywhere.

Python 3 has os.environb on Unix. You can use the .detach() method on
text streams to get the underlying binary stream.

> * In relation to the previous point, one of the reasons why unicode is used
> that thinks like text[:10] actually return 10 characters to display.

Although some of those characters may be combining characters or
control codes. Unicode characters don't necessarily map 1:1 with
glyphs.

> * Users of the Python API who are using Python 3 will expect unicode
> strings to work, i.e. expect run_command('g.region', flags='p') to work
> (not just run_command(b'g.region', flags=b'p')).

Even if you automatically encode unicode strings, there's no guarantee
that it will work (e.g. if the string is a filename, then the encoded
string must produce the correct sequence of bytes).

I can't think of any significant cases where it's likely to be
necessary to pass "binary" data via arguments, although it should be
trivial to simply accept data which is already a byte string.

The bigger issue is with output: the output from GRASS commands isn't
guaranteed to be in the locale's encoding (if it's extracted from a
file, it's going to be in whatever encoding the file uses). Returning
bytes allows the user to deal with this; automatically decoding the
data will either raise an exception or return mojibake if the encoding
doesn't match.

> * It seems hard to predict when we will know the right encoding of the
> text.

Which is why byte-oriented interfaces still exist and still matter,
and will do so for the foreseeable future.

Python's solution is to accelerate standardisation on Unicode by
making the alternatives as painful as possible. Yet legacy encodings
remain widespread

--
Glynn Clements <[hidden email]>
_______________________________________________
grass-dev mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/grass-dev