Tensigh
Tensigh

Reputation: 1050

How to detect double-byte numbers

I have to check strings in Japanese that are encoded in double-byte characters (naturally the files aren't in Unicode and I have to keep them in Shift-JIS). Many of these strings contain digits that are also double byte characters, (123456789) instead of standard single-byte digits (0-9). As such, the usual methods of searching for digits won't work (using [0-9] in regex, or \d for example).

The only way I've found to make it work is to create a tuple and iterate over the tuple in a string to look for a match, but is there a more effective way of doing this?

This is an example of the output I get when searching for double byte numbers:

>>> s = "234"  # "2" is a double-byte integer
>>> if u"2" in s:
      print "y"

>>> if u"2" in s:
      print "y"

    y
>>> print s[0]

>>> print s[:2]
    2
>>> print s[:3]
    23

Any advice would be greatly appreciated!

Upvotes: 2

Views: 4024

Answers (2)

Jérôme Bau
Jérôme Bau

Reputation: 707

I had a similar problem when facing Japanese two-byte characters and one relatively easy way to deal with the characters that I found is to transform them using the simple Unicode numbers (at least for processing them, if you want to keep the document as it is):

ord("2")

will return

65298

which is 65248 points away from the one-byte characters 2. So converting back can be done using:

def convert_two_byte_numbers(character: str):
    if ord(character) in range(65296, 65306):
        return chr(ord(character) - 65248)
    else: 
        return character

If, like me, you also need to convert two-byte letters, add the same thing for the ranges (65313, 65339) and (65345, 65371).

Upvotes: 0

schesis
schesis

Reputation: 59158

First of all, the comments are right: for the sake of your sanity, you should only ever work with unicode inside your Python code, decoding from Shift-JIS that comes in, and encoding back to Shift-JIS if that's what you need to output:

text = incoming_bytes.decode("shift_jis")
# ... do stuff ...
outgoing_bytes = text.encode("shift_jis")

See: Convert text at the border.

Now that you're doing it right re: unicode and encoded bytestrings, it's straightforward to get either "any digit" or "any double width digit" with a regex:

>>> import re
>>> s = u"234"
>>> digit = re.compile(r"\d", re.U)
>>> for d in re.findall(digit, s):
...     print d,
... 
2 3 4
>>> wdigit = re.compile(u"[0-9]+")
>>> for wd in re.findall(wdigit, s):
...     print wd,
... 
2

In case the re.U flag is unfamiliar to you, it's documented here.

Upvotes: 4

Related Questions