Amarsh
Amarsh

Reputation: 11804

Sample parser code for the CEDICT

Does anyone have a sample code for parsing the CEDICT file? CEDICT is a Chinese-English Dictionary. For instance, currently, if I open it in a text editor, a line in the CEDICT file looks like:

不 不 [bu4] /(negative prefix)/not/no/

I would like to see it as:

不 不 [bu4] /(negative prefix)/not/no/

I found Textwrangler to do this for me as a text editor. What I now need is sample code that achieves the same.

Upvotes: 0

Views: 548

Answers (1)

dda
dda

Reputation: 6213

The thing is, it's just an encoding problem. If the line looks like

不 不 [bu4] /(negative prefix)/not/no/

It's because the text editor doesn't know/realize that the text is encoded as UTF-8. Text Wrangler, or its big brother BBEdit, are very good at guessing encoding, and can even be asked to display text in a specific encoding.

Since we don't know what you want, in the end, to achieve, it's hard to tell you exactly what has to be done, specifically. What I can say is that your app (which language are you using anyway?) needs to be Unicode aware (and be able to read/manipulate UTF strings).

I wrote a couple of apps based on the CEDICT, one for Mac OS X, one for Android. Parsing and indexing the CEDICT is not very hard.

UPDATE

Regarding the parsing itself of the CEDICT, it's nothing complicated. I don't do Objective-C, never have, never will, but the process would be the same in any language:

  • Read a line. Say your own example: 不 不 [bu4] /(negative prefix)/not/no/
  • You have four fields: Trad. Ch., Simp. Ch., Reading, Meaning(s). These fields are space separated. Of course the 4th field may contain spaces, so be careful.
  • Store (I used an sqlite db) the 4 fields in to db. You might want to remove the slashes from the definition field, replace them with something else.
  • Loop

You have now converted the CEDICT to a database. That's the easy part. As for tokenizing Chinese, good luck with that, mate. Better minds than mine are still banging their heads on this one.

Upvotes: 2

Related Questions