Ildar Akhmetov
Ildar Akhmetov

Reputation: 1431

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST. One of the parameters is a non-Unicode string, encoded in cp1251.

Can't find a way to correctly parse this argument using reqparse.

Here is the fragment of my code:

parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()

Then, I write msg to a text file, and it looks like this:

{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}

As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.

Anything I can do to advise RequestParser with the string encoding?

Here is my code for writing the text to disk:

 f = open('log_msg.txt', 'w+')
 f.write(json.dumps(msg))
 f.close()

I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.

Then, I tried

 f = open('log_msg_ascii.txt', 'w+')
 f.write(ascii(json.dumps(msg)))

Also, no difference.

So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.

Thanks!

Upvotes: 0

Views: 614

Answers (1)

Ildar Akhmetov
Ildar Akhmetov

Reputation: 1431

Okay, I finally found a workaround. Thanks to @lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).

So, to access that non-Unicode field, I did the following trick.

First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.

 raw_data = request.get_data()
 contents = raw_data.decode('windows-1251')
 match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
 text = match.group(2)

Not the most beautiful solution, but it works.

Upvotes: 1

Related Questions