Reputation: 3466
I have an extracted Cyrillic content from a HTML page to a text file. The Cyrillic is OK in this file. Then I use this file to create a RDF file using Jena. Here is my code:
private void createRDFFile(String webContentFilePath) throws IOException {
// TODO Auto-generated method stub
Model model = ModelFactory.createDefaultModel();
RDFWriter writer = model.getWriter("RDF/XML");
writer.setProperty("showXmlDeclaration", "true");
writer.setProperty("showDoctypeDeclaration", "true");
writer.setProperty("tab", "8");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(rdfFilePath), "UTF8"));
Resource resDest = null;
Property hasTimeStart = model.createProperty(ns + "#hasTimeStart");
Property distrName = model.createProperty(ns + "#distrName");
Property moneyOneDir = model.createProperty(ns + "#moneyOneDir");
Property moneyTwoDir = model.createProperty(ns + "#moneyTwoDir");
Property hasTimeStop = model.createProperty(ns + "#hasTimeStop");
BufferedReader br = new BufferedReader(new FileReader(
webContentFilePath));
String line = "";
while ((line = br.readLine()) != null) {
String[] arrayLine = line.split("\\|");
resDest = model.createResource(ns + arrayLine[5]);
resDest.addProperty(hasTimeStart, arrayLine[0]);
resDest.addProperty(distrName, arrayLine[1]);
resDest.addProperty(moneyOneDir, arrayLine[2]);
resDest.addProperty(moneyTwoDir, arrayLine[3]);
resDest.addProperty(hasTimeStop, arrayLine[4]);
}
br.close();
model.write(System.out, "RDF/XML");
writer.write(model, out, null);
}
When I open the RDF file the Cyrillic is like РўР РђРќРЎРљРћРџ-Р‘Р?ТОЛА. Could somebody help me?
Upvotes: 0
Views: 166
Reputation: 16630
It could be that the output is correct, but you're not seeing it correctly.
new FileReader(...) will open the file with the platform-default character set. This is not UTF-8 on Windows, so if it looks right, then you maybe viewing it in something other than UTF-8.
Jena writes in UTF-8 by default and in this case.
So when you write the file, you can not view it the same way you viewed the input. You need to view it with a UTF-8 aware viewer.
Upvotes: 1
Reputation: 13305
The UTF-8 write encoding on the output writer looks correct, so that suggests that you're not reading webContentFilePath
with the correct encoding. As a diagnostic, you could try just reading that file in and then writing it out to a plain UTF-8 file (no RDF). My guess is that you will have to be explicit about setting the file encoding on br
, or ensure that the scraped web page is written out in UTF-8 to begin with.
Upvotes: 2