vikifor
vikifor

Reputation: 3466

Write Cyrillic to RDF file using jena library

I have an extracted Cyrillic content from a HTML page to a text file. The Cyrillic is OK in this file. Then I use this file to create a RDF file using Jena. Here is my code:

private void createRDFFile(String webContentFilePath) throws IOException {
    // TODO Auto-generated method stub
    Model model = ModelFactory.createDefaultModel();

    RDFWriter writer = model.getWriter("RDF/XML");
    writer.setProperty("showXmlDeclaration", "true");
    writer.setProperty("showDoctypeDeclaration", "true");
    writer.setProperty("tab", "8");
    Writer out = new BufferedWriter(new OutputStreamWriter(
            new FileOutputStream(rdfFilePath), "UTF8"));
    Resource resDest = null;
    Property hasTimeStart = model.createProperty(ns + "#hasTimeStart");
    Property distrName = model.createProperty(ns + "#distrName");
    Property moneyOneDir = model.createProperty(ns + "#moneyOneDir");
    Property moneyTwoDir = model.createProperty(ns + "#moneyTwoDir");
    Property hasTimeStop = model.createProperty(ns + "#hasTimeStop");

    BufferedReader br = new BufferedReader(new FileReader(
            webContentFilePath));
    String line = "";
    while ((line = br.readLine()) != null) {
        String[] arrayLine = line.split("\\|");
        resDest = model.createResource(ns + arrayLine[5]);
        resDest.addProperty(hasTimeStart, arrayLine[0]);
        resDest.addProperty(distrName, arrayLine[1]);
        resDest.addProperty(moneyOneDir, arrayLine[2]);
        resDest.addProperty(moneyTwoDir, arrayLine[3]);
        resDest.addProperty(hasTimeStop, arrayLine[4]);
    }
    br.close();
    model.write(System.out, "RDF/XML");
    writer.write(model, out, null);

}

When I open the RDF file the Cyrillic is like РўР РђРќРЎРљРћРџ-Р‘Р?ТОЛА. Could somebody help me?

Upvotes: 0

Views: 166

Answers (2)

AndyS
AndyS

Reputation: 16630

It could be that the output is correct, but you're not seeing it correctly.

new FileReader(...) will open the file with the platform-default character set. This is not UTF-8 on Windows, so if it looks right, then you maybe viewing it in something other than UTF-8.

Jena writes in UTF-8 by default and in this case.

So when you write the file, you can not view it the same way you viewed the input. You need to view it with a UTF-8 aware viewer.

Upvotes: 1

Ian Dickinson
Ian Dickinson

Reputation: 13305

The UTF-8 write encoding on the output writer looks correct, so that suggests that you're not reading webContentFilePath with the correct encoding. As a diagnostic, you could try just reading that file in and then writing it out to a plain UTF-8 file (no RDF). My guess is that you will have to be explicit about setting the file encoding on br, or ensure that the scraped web page is written out in UTF-8 to begin with.

Upvotes: 2

Related Questions