Reputation: 1050
I have to check strings in Japanese that are encoded in double-byte characters (naturally the files aren't in Unicode and I have to keep them in Shift-JIS). Many of these strings contain digits that are also double byte characters, (123456789) instead of standard single-byte digits (0-9). As such, the usual methods of searching for digits won't work (using [0-9] in regex, or \d for example).
The only way I've found to make it work is to create a tuple and iterate over the tuple in a string to look for a match, but is there a more effective way of doing this?
This is an example of the output I get when searching for double byte numbers:
>>> s = "234" # "2" is a double-byte integer
>>> if u"2" in s:
print "y"
>>> if u"2" in s:
print "y"
y
>>> print s[0]
>>> print s[:2]
2
>>> print s[:3]
23
Any advice would be greatly appreciated!
Upvotes: 2
Views: 4024
Reputation: 707
I had a similar problem when facing Japanese two-byte characters and one relatively easy way to deal with the characters that I found is to transform them using the simple Unicode numbers (at least for processing them, if you want to keep the document as it is):
ord("2")
will return
65298
which is 65248 points away from the one-byte characters 2
. So converting back can be done using:
def convert_two_byte_numbers(character: str):
if ord(character) in range(65296, 65306):
return chr(ord(character) - 65248)
else:
return character
If, like me, you also need to convert two-byte letters, add the same thing for the ranges (65313, 65339)
and (65345, 65371)
.
Upvotes: 0
Reputation: 59158
First of all, the comments are right: for the sake of your sanity, you should only ever work with unicode inside your Python code, decoding from Shift-JIS that comes in, and encoding back to Shift-JIS if that's what you need to output:
text = incoming_bytes.decode("shift_jis")
# ... do stuff ...
outgoing_bytes = text.encode("shift_jis")
See: Convert text at the border.
Now that you're doing it right re: unicode and encoded bytestrings, it's straightforward to get either "any digit" or "any double width digit" with a regex:
>>> import re
>>> s = u"234"
>>> digit = re.compile(r"\d", re.U)
>>> for d in re.findall(digit, s):
... print d,
...
2 3 4
>>> wdigit = re.compile(u"[0-9]+")
>>> for wd in re.findall(wdigit, s):
... print wd,
...
2
In case the re.U
flag is unfamiliar to you, it's documented here.
Upvotes: 4