why I can't swap unicode characters in python -
why can't swap unicode characters in code?
# -*- coding: utf-8 -*- character_swap = {'ą': 'a', 'ż': 'z', 'ó': 'o'} text = 'idzie wąż wąską dróżką' print text print ''.join(character_swap.get(ch, ch) ch in text)
output: idzie wąż wąską dróżką
expected output: idzie waz waska drozka
you need encode text first decode characters again :
>>> print ''.join(character_swap.get(ch.encode('utf8'), ch) ch in text.decode('utf8')) idzie waz waska drozka
its because of python list comprehension doesn't encode unicode default,actually doing :
>>> [i in text] ['i', 'd', 'z', 'i', 'e', ' ', 'w', '\xc4', '\x85', '\xc5', '\xbc', ' ', 'w', '\xc4', '\x85', 's', 'k', '\xc4', '\x85', ' ', 'd', 'r', '\xc3', '\xb3', '\xc5', '\xbc', 'k', '\xc4', '\x85']
and character ą
have :
>>> 'ą' '\xc4\x85'
as can see within list comprehension python split in 2 part \xc4
, \x85
. getting ride of can first decode text utf8
encocding :
>>> [i in text.decode('utf8')] [u'i', u'd', u'z', u'i', u'e', u' ', u'w', u'\u0105', u'\u017c', u' ', u'w', u'\u0105', u's', u'k', u'\u0105', u' ', u'd', u'r', u'\xf3', u'\u017c', u'k', u'\u0105']
Comments
Post a Comment