disorderdev
disorderdev

Reputation: 1468

Saving Chinese to mongodb 2.4.8 cause unreadable string

Before I used Mongodb 2.0.6, everything is fine. recently I started to use Mongodb 2.4.8 with Java Play framework, and I found that when I tried to save Chinese to mongodb, mongodb actually stored as some unreadable string, such as &\#21457;&\#29983;, what is show on web is the same string, does anything know why?

what should I do? how to convert it to readable Chinese?

Upvotes: 5

Views: 3085

Answers (4)

bitinn
bitinn

Reputation: 9358

While I have no experience with play framework specifically, the general approach to resolve your issue is to try logging/dumping such string right before it's passed to your mongodb driver, if:

  1. the string is still encoded as utf-8, not entity (&#...), you need to check if your mongodb driver for 2.4 is updated with some new options that convert utf-8 into entities.

  2. if the string is already converted to entities, well you at least ruled out mongodb driver and should track down the conversion within play framework instead.

As others have mentioned, mongodb itself does not care if your input are entities or not, as long as they are utf-8 encoded. it's more likely play framework or the mongodb driver is to blame.

PS: I assume unreable means they were converted to entities (&#...), not encoded incorrectly.

Upvotes: 1

daveh
daveh

Reputation: 3706

From what you have posted I suspect that this may be an artefact of the Play Framework, as both these characters can be stored directly in MongoDB.

> db.test1.insert({x:"𡑗 and 𩦃"})
> db.test1.find();
{ "_id" : ObjectId("52a12237e7c9d6190f6feb95"), "x" : "𡑗 and 𩦃" }

Assuming that the characters you posted as &#21457 and &#29983 above are really meant to be 𡑗 and 𩦃 then I would suspect that the Play Framework is converting them into a representation of their extended unicode values. In this case those two characters would be from the "CJK Unified Ideographs Extension B" section.

You can view the whole set of characters here: http://codepoints.net/cjk_unified_ideographs_extension_b

This looks to be a similar issue as here in the play-framework google group.

Upvotes: 3

evanchooly
evanchooly

Reputation: 6243

I just wrote a quick test and this works just fine.

package com.mongodb;

import com.mongodb.util.TestCase;
import org.junit.Assert;
import org.junit.Test;

public class EncodingTest extends TestCase {
    String chinese = "你好";

    @Test
    public void saveChinese() {
        DBCollection collection = getDatabase().getCollection("chinese");
        collection.insert(new BasicDBObject().append("message", chinese));
        DBObject object = collection.findOne();
        Assert.assertEquals(chinese, object.get("message"));
    }
}

That text saves and loads without error. It would help to see what code you're using to test.

Upvotes: 2

deepakmodak
deepakmodak

Reputation: 1339

I think,your string gets converted to unreadable string in between.As I tested this on console and works fine for me.

 $ mongo test
 MongoDB shell version: 2.4.8
 connecting to: test
 > var doc = { "message" :"你好" }
 > db.ChineseWord.save(doc)
 > db.ChineseWord.find().pretty()
 { "_id" : ObjectId("529da2018170273efa43e181"), "message" : "你好" }

Upvotes: 6

Related Questions