Reputation: 189
With Apache Jena Fuseki I am trying to load the latest-truthy.nt dataset from Wikidata, but I am getting the following error while trying to import the file. With the inspiration from the following success from Bitplan where they did have success.
Error log:
14:36:16 INFO loader :: Add: 198.500.000 latest-truthy.nt (Batch: 453.309 / Avg: 213.382)
14:36:17 ERROR riot :: [line: 198884173, col: 87] Bad IRI: <https://[email protected]> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
org.apache.jena.riot.RiotException: [line: 198884173, col: 87] Bad IRI: <https://[email protected]> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)
at org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)
at org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)
at org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)
at org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
at org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)
at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:109)
at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:323)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:298)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)
at org.apache.jena.tdb2.loader.base.LoaderOps.inputFile(LoaderOps.java:107)
at org.apache.jena.tdb2.loader.base.LoaderBase.loadOne(LoaderBase.java:125)
at org.apache.jena.tdb2.loader.base.LoaderBase.lambda$load$0(LoaderBase.java:102)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
at org.apache.jena.tdb2.loader.base.LoaderBase.load(LoaderBase.java:99)
at tdb2.tdbloader.lambda$execBulkLoad$4(tdbloader.java:196)
at org.apache.jena.atlas.lib.Timer.time(Timer.java:85)
at tdb2.tdbloader.execBulkLoad(tdbloader.java:194)
at tdb2.tdbloader.loadQuads(tdbloader.java:175)
at tdb2.tdbloader.exec(tdbloader.java:136)
at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:92)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:58)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:45)
at tdb2.tdbloader.main(tdbloader.java:64)
Script to import:
@ECHO off
cd apache-jena-4.0.0
echo start import on %DATE% %TIME%
tdb2_tdbloader --loader=parallel --loc "C:\fuseki\data" "F:\latest-truthy.nt" > tdb2-out.log 2> tdb2-err.log
echo finish import on %DATE% %TIME%
pause
File structure:
- C:/fuseki/
-- apache-jena-4.0.0/
-- apache-jena-fuseki-4.0.0/
-- data/
-- startfusekidb.bat
-- wikidata2fuseki.bat
- F:/
-- latest-truthy.nt
Is this an issue with Fuseki? I can't open the .nt file myself to remove the issue. Is there any flags I can use so it skips validation for the given import with tdbloader?
I am also asking this in the IRC channel of Wikidata to see if they might be able to help me.
UPDATE: I got answer from someone at IRC and they told me a whole lot of errors exist in the dataset Errors in Wikidata So I know need to find a way to skip error related lines and continue loading. But the Fuseki TDB2 Commands don't show anything of help.
Also trying --help outputs the following, thus indicating skipping doesn't exist?
c:\fuseki\apache-jena-4.0.0\bin>tdb2_tdbloader -h
tdbloader--loader= [--desc DATASET | --loc DIR] FILE ...
Location
--loc=DIR Location (a directory)
--tdb= Assembler description file
--graph=IRI Act on a named graph
--loader= Loader to use: 'basic', 'phased' (default), 'sequential', 'parallel' or 'light'
--syntax=LANG Syntax of data from stdin
Symbol definition
--set Set a configuration symbol to a value
--mem=FILE Execute on an in-memory TDB database (for testing)
--desc= Assembler description file
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information
--strict Operate in strict SPARQL mode (no extensions of any kind)
Upvotes: 1
Views: 528
Reputation: 15604
@NLxDoDge - thx for pointing to my BITPlan success story. Indeed wikidata nt dumps may contain incompatible triples for import with Jena 4.1 - i ran into a similar problem today with the https://wdumps.toolforge.org/dump/1607 of human settlements.
A single triple:
<http://www.wikidata.org/entity/Q992883> <http://www.wikidata.org/prop/direct/P856> <http://www.sonora.gob.mx/portal/Runscript.asp?p=ASP\\pg239.asp> .
would spoil the show giving the error:
10:45:06 ERROR riot :: [line: 6964090, col: 139] Illegal unicode escape sequence value: \\ (0x5C)
org.apache.jena.riot.RiotException: [line: 6964090, col: 139] Illegal unicode escape sequence value: \\ (0x5C)
I simply edited the 1.2 GB wdump-1607.nt file vim where you can jump to the line number with the
:6964090
and then save the file
:wq!
You might want to try out your environment with this small 100 MB dump file which expands to 1.2 GB before trying out the wikidata full import which in the end will need >2TB of SSD ! disk space to work.
Please find below the scripts I used for importing the dump and starting the fuseki server.
You should get a result like
tail -f tdb2--err.log
10:57:24 INFO loader :: Loader = LoaderPhased
10:57:24 INFO loader :: Start: wdump-1607.nt
10:57:27 INFO loader :: Add: 500,000 wdump-1607.nt (Batch: 170,706 / Avg: 170,706)
10:57:29 INFO loader :: Add: 1,000,000 wdump-1607.nt (Batch: 255,102 / Avg: 204,540)
10:57:31 INFO loader :: Add: 1,500,000 wdump-1607.nt (Batch: 229,885 / Avg: 212,344)
10:57:33 INFO loader :: Add: 2,000,000 wdump-1607.nt (Batch: 245,579 / Avg: 219,780)
10:57:36 INFO loader :: Add: 2,500,000 wdump-1607.nt (Batch: 185,804 / Avg: 212,026)
10:57:39 INFO loader :: Add: 3,000,000 wdump-1607.nt (Batch: 146,627 / Avg: 197,355)
10:57:43 INFO loader :: Add: 3,500,000 wdump-1607.nt (Batch: 140,567 / Avg: 186,587)
10:57:46 INFO loader :: Add: 4,000,000 wdump-1607.nt (Batch: 142,166 / Avg: 179,573)
10:57:50 INFO loader :: Add: 4,500,000 wdump-1607.nt (Batch: 134,444 / Avg: 173,116)
10:57:53 WARN riot :: [line: 4869426, col: 86] Bad IRI: <http://:goku.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing.
10:57:54 INFO loader :: Add: 5,000,000 wdump-1607.nt (Batch: 143,307 / Avg: 169,589)
10:57:54 INFO loader :: Elapsed: 29.48 seconds [2021/08/14 10:57:54 CEST]
10:57:57 INFO loader :: Add: 5,500,000 wdump-1607.nt (Batch: 139,314 / Avg: 166,303)
10:58:01 INFO loader :: Add: 6,000,000 wdump-1607.nt (Batch: 146,842 / Avg: 164,487)
10:58:04 INFO loader :: Add: 6,500,000 wdump-1607.nt (Batch: 143,678 / Avg: 162,674)
10:58:08 INFO loader :: Add: 7,000,000 wdump-1607.nt (Batch: 142,085 / Avg: 161,008)
10:58:11 INFO loader :: Add: 7,500,000 wdump-1607.nt (Batch: 144,300 / Avg: 159,775)
10:58:15 INFO loader :: Add: 8,000,000 wdump-1607.nt (Batch: 141,362 / Avg: 158,484)
10:58:18 INFO loader :: Add: 8,500,000 wdump-1607.nt (Batch: 141,083 / Avg: 157,343)
10:58:22 INFO loader :: Add: 9,000,000 wdump-1607.nt (Batch: 147,492 / Avg: 156,761)
10:58:23 INFO loader :: Finished: wdump-1607.nt: 9,179,041 tuples in 58.91s (Avg: 155,812)
10:58:32 INFO loader :: Finish - index SPO
10:58:32 INFO loader :: Start replay index SPO
10:58:32 INFO loader :: Index set: SPO => SPO->POS, SPO->OSP
10:58:32 INFO loader :: Add: 1,000,000 Index (Batch: 8,928,571 / Avg: 8,928,571)
10:58:34 INFO loader :: Add: 2,000,000 Index (Batch: 508,130 / Avg: 961,538)
10:58:36 INFO loader :: Add: 3,000,000 Index (Batch: 381,388 / Avg: 638,026)
10:58:39 INFO loader :: Add: 4,000,000 Index (Batch: 370,233 / Avg: 540,321)
10:58:42 INFO loader :: Add: 5,000,000 Index (Batch: 362,450 / Avg: 492,029)
10:58:45 INFO loader :: Add: 6,000,000 Index (Batch: 370,644 / Avg: 466,562)
10:58:47 INFO loader :: Add: 7,000,000 Index (Batch: 367,647 / Avg: 449,293)
10:58:50 INFO loader :: Add: 8,000,000 Index (Batch: 366,166 / Avg: 436,895)
10:58:53 INFO loader :: Add: 9,000,000 Index (Batch: 380,952 / Avg: 429,881)
10:58:54 INFO loader :: Index set: SPO => SPO->POS, SPO->OSP [9,174,324 items, 22.0 seconds]
10:58:58 INFO loader :: Finish - index OSP
10:58:59 INFO loader :: Finish - index POS
10:58:59 INFO loader :: Time = 94.428 seconds : Triples = 9,179,041 : Rate = 97,207 /s
script to run fuseki
#!/bin/bash
# WF 2020-06-25
# WF 2021-08-14
# Jena Fuseki server installation
# see https://jena.apache.org/documentation/fuseki2/fuseki-run.html
version=4.1.0
fuseki=apache-jena-fuseki-$version
if [ ! -d $fuseki ]
then
if [ ! -f $fuseki.tar.gz ]
then
wget http://archive.apache.org/dist/jena/binaries/$fuseki.tar.gz
else
echo $fuseki.tar.gz already downloaded
fi
echo "unpacking $fuseki.tar.gz"
tar xvfz $fuseki.tar.gz
else
echo $fuseki already downloaded and unpacked
fi
cd $fuseki
java -jar fuseki-server.jar --tdb2 --loc=../data /wdhs
script to load data
#!/bin/bash
# WF 2020-10-05
# WF 2021-08-14
# global settings
jena=apache-jena-4.1.0
tgz=$jena.tar.gz
mirror=https://downloads.apache.org/jena/binaries
jenaurl=$mirror/$tgz
base=$(pwd)
#base=/hd/luxio/gnd
data=$base/data
tdbloader=$jena/bin/tdb2.tdbloader
getjena() {
# download
if [ ! -f $tgz ]
then
echo "downloading $tgz from $jenaurl"
wget $jenaurl
else
echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
echo "unpacking $jena from $tgz"
tar xvzf $tgz
else
echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
echo "creating $data directory"
mkdir -p $data
else
echo "$data directory already created"
fi
}
#
# show the given timestamp
#
timestamp() {
local msg="$1"
local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
echo "$msg at $ts"
}
#
# load data for the given data dir and input
#
loaddata() {
local data="$1"
local input="$2"
timestamp "start loading $input to $data"
$tdbloader --loc "$data" "$input" > tdb2-$phase-out.log 2> tdb2-$phase-err.log
timestamp "finished loading $input to $data"
}
getjena
export TMPDIR=$base/tmp
if [ ! -d $TMPDIR ]
then
echo "creating temporary directory $TMPDIR"
mkdir $TMPDIR
else
echo "using temporary directory $TMPDIR"
fi
for file in *.nt
do
loaddata $data $file
done
Upvotes: 2