Reputation: 187
I have created a bunch of ttl files from edgelist graph data available publicly using my metadata specification. I am not able to upload some of these ttl files onto Fuseki. This is what they look like (the structure) :
[] <authorID> <1399> ;
<authorName> "Dimitris Samaras";.
<1399> <authorIDof> "Dimitris Samaras" . //line 363
<1399> <nodetype> <AUTHOR> .
[] <authorID> <1407> ;
<authorName> "Haojun Wang";.
<1407> <authorIDof> "Haojun Wang" .
<1407> <nodetype> <AUTHOR> .
[] <authorID> <1450> ;
<authorName> "Zhigang Zhu";.
<1450> <authorIDof> "Zhigang Zhu" .
<1450> <nodetype> <AUTHOR> .
and so on....
Fuseki gives me the following error when I try uploading the file:
14:32:33 INFO [80] POST http://localhost:3030/ds/upload
14:32:33 INFO [80] Upload: Filename: dblp1111.ttl, Content-Type=application/oct
et-stream, Charset=null => Turtle
14:32:33 ERROR [line: 363, col: 11] Bad character encoding
14:32:33 INFO [80] 400 Parse error: [line: 363, col: 11] Bad character encoding
(25 ms)
Where am I going wrong?
Upvotes: 0
Views: 1874
Reputation: 16630
(corrected answer)
This is the one case where the line number is wrong. It merely indicates where the parser was at the time of the error (bad encoding in UTF-8) but the parser reads ahead and uses Java's bult-in bytes-to-chars UTF8 conversion in large blocks (128K) for efficiency.
Java does not report where the bad encoding is in the byte stream, only that there is an error. So you'll have to "divide-and-conquer"
You might try the program in Jena "arq.utf8" which reads UTF-8 and oes it's own conversion in such a way as to report the place where the bad encoding is situated (to within a few character positions).
[Wrong answer]
Turtle is UTF-8 - there is no choice. I suspect that "Dimitris Samaras" actually has accented characters which are differently encoded in ISO-8859 and UTF-8.
Upvotes: 4