Missing Pieces in Python 3 Unicode

12 thoughts
last posted March 3, 2015, 6:21 a.m.

4 earlier thoughts

0

However, for the status quo, there's still a few pieces missing. For a "sorta decoded" surrogate escaped string, the dance to turn it back into a properly decoded string with no surrogates is like this:

sorta_decoded_str.encode(assumed_encoding, errors="surrogateescape").decode(correct_encoding)

The case where the assumed encoding is latin-1 is just a special case of this one, since the surrogate escape error handler will never fire in that situation (since latin-1 is a direct mapping of bytes values to the first 256 Unicode code points)

7 later thoughts