Reputation: 1637
I have to use a word2vec module containing tons of Chinese characters. The module was trained by my coworkers using Java and is saved as a bin file.
I installed gensim and tries to load the module, but following error occurred:
In [1]: import gensim
In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data
I tried to load the module both in python 2.7 and 3.5, failed in the same way. So how can I load the module in gensim? Thanks.
Upvotes: 5
Views: 8569
Reputation: 1
You can reformat the word2vec file, using the following Java code:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
class ReformatW2V {
public static void main(String[] args) {
if (args.length < 3) {
System.err.println("Usage: ReformatW2V inputFileName outputFileName");
return;
}
String inputFileName = args[1];
String outputFileName = args[2];
try (
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(new File(inputFileName)), "utf-8"));
PrintWriter pw = new PrintWriter(new OutputStreamWriter(new FileOutputStream(new File(outputFileName)), "utf-8"))
) {
String line = null;
while ((line = br.readLine()) != null) {
String[] segs = line.split(" ");
pw.println(String.join(" ", segs));
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Upvotes: 0
Reputation: 386
I have tried the flag
unicode_errors='ignore'
but it does not solve the unicode problem.
I checked that I got the unicode error after I rename the file from filename.bin.gz to filename.gz.
My solution is to extract the compressed file, instead of renaming it.
Then I use the file with the flag above and there is no unicode error.
Note that I use Mac (Sierra) with python 2.7.
Upvotes: 2
Reputation: 1637
The module was tons of Chinese characters trained by Java. I cannot figure out the encoding format of the original corpus. The error can be solved as the description in gensim FAQ,
Using load_word2vec_format with a flag for ignoring the character decoding errors:
In [1]: import gensim
In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True, unicode_errors='ignore')
But I've no idea whether it matters when ignoring the encoding errors.
Upvotes: 6