Reputation: 95
I'm using Windows 7 64-bit, Python 3, MongoDB, and PyMongo. I know that in Python 3, all strings are unicode. I also know that MongoDB stores all strings as unicode. So I don't understand why, when I pull a document from my database where the value of a particular field is "C:\Some Folder\E=mc².xyz", Python treats that string as "C:\Some Folder\E=mc².xyz". It doesn't just print that way; os.path.exists() returns False. Now, as if that wasn't confusing enough, if I save the string to a text file, and then open it with the encoding explicitly set to "utf-8", the string appears correctly, and os.path.exists() returns True. What's going wrong, and how do I fix it?
Edit: Here's some code I just wrote to demonstrate my problem:
from pymongo import MongoClient
db = MongoClient().test_db
orig_doc = {'string': 'E=mc²'}
_id = db.test_col.insert(orig_doc)
new_doc = db.test_col.find_one(_id)
print(new_doc['string'])
>>> E=mc²
As you can see, it works exactly as it should! Thus I now realize that I must've messed up when I migrated from PostgreSQL. Now I just need to fix the strings. I know that it's possible, but there's got to be a better way than writing the strings to a text file and then reading them back. I could do that, just as I did in my previous testing, but it just doesn't seem like the right way.
Upvotes: 0
Views: 2713
Reputation: 177674
You can't store Unicode. It is a concept. MongoDB has to be using an encoding of Unicode, and it looks like UTF-8
. Python 3 Unicode strings are stored internally as one of a number of encodings depending on the content of the string. What you have is a string decoded to Unicode with the wrong encoding:
>>> s='"C:\Some Folder\E=mc².xyz"' # The invalid decoding.
>>> print(s)
"C:\Some Folder\E=mc².xyz"
>>> print(s.encode('latin1').decode('utf8')) # Undo the wrong decoding, and apply the right one.
"C:\Some Folder\E=mc².xyz"
There's not enough information to tell you how to read MondoDB correctly, but this should help you along.
Upvotes: 1