NLxDoDge
NLxDoDge

Reputation: 189

Trying to load Wikidata truthy-latest.nt with tdb2.tdbloader results in Code: 58/PROHIBITED_COMPONENT_PRESENT in USER

With Apache Jena Fuseki I am trying to load the latest-truthy.nt dataset from Wikidata, but I am getting the following error while trying to import the file. With the inspiration from the following success from Bitplan where they did have success.

Error log:

14:36:16 INFO  loader          :: Add: 198.500.000 latest-truthy.nt (Batch: 453.309 / Avg: 213.382)
14:36:17 ERROR riot            :: [line: 198884173, col: 87] Bad IRI: <https://[email protected]> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
org.apache.jena.riot.RiotException: [line: 198884173, col: 87] Bad IRI: <https://[email protected]> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
    at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)
    at org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)
    at org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)
    at org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)
    at org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
    at org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)
    at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:109)
    at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
    at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
    at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
    at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
    at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
    at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:323)
    at org.apache.jena.riot.RDFParser.parse(RDFParser.java:298)
    at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)
    at org.apache.jena.tdb2.loader.base.LoaderOps.inputFile(LoaderOps.java:107)
    at org.apache.jena.tdb2.loader.base.LoaderBase.loadOne(LoaderBase.java:125)
    at org.apache.jena.tdb2.loader.base.LoaderBase.lambda$load$0(LoaderBase.java:102)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
    at org.apache.jena.tdb2.loader.base.LoaderBase.load(LoaderBase.java:99)
    at tdb2.tdbloader.lambda$execBulkLoad$4(tdbloader.java:196)
    at org.apache.jena.atlas.lib.Timer.time(Timer.java:85)
    at tdb2.tdbloader.execBulkLoad(tdbloader.java:194)
    at tdb2.tdbloader.loadQuads(tdbloader.java:175)
    at tdb2.tdbloader.exec(tdbloader.java:136)
    at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:92)
    at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:58)
    at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:45)
    at tdb2.tdbloader.main(tdbloader.java:64)

Script to import:

@ECHO off
cd apache-jena-4.0.0
echo start import on %DATE% %TIME%

tdb2_tdbloader --loader=parallel --loc "C:\fuseki\data" "F:\latest-truthy.nt" > tdb2-out.log 2> tdb2-err.log

echo finish import on %DATE% %TIME%
pause

File structure:

- C:/fuseki/
-- apache-jena-4.0.0/
-- apache-jena-fuseki-4.0.0/
-- data/
-- startfusekidb.bat
-- wikidata2fuseki.bat

- F:/
-- latest-truthy.nt

Is this an issue with Fuseki? I can't open the .nt file myself to remove the issue. Is there any flags I can use so it skips validation for the given import with tdbloader?

I am also asking this in the IRC channel of Wikidata to see if they might be able to help me.

UPDATE: I got answer from someone at IRC and they told me a whole lot of errors exist in the dataset Errors in Wikidata So I know need to find a way to skip error related lines and continue loading. But the Fuseki TDB2 Commands don't show anything of help.

Also trying --help outputs the following, thus indicating skipping doesn't exist?

c:\fuseki\apache-jena-4.0.0\bin>tdb2_tdbloader -h
tdbloader--loader= [--desc DATASET | --loc DIR] FILE ...
  Location
      --loc=DIR              Location (a directory)
      --tdb=                 Assembler description file
      --graph=IRI            Act on a named graph
      --loader=              Loader to use: 'basic', 'phased' (default), 'sequential', 'parallel' or 'light'
      --syntax=LANG          Syntax of data from stdin
  Symbol definition
      --set                  Set a configuration symbol to a value
      --mem=FILE             Execute on an in-memory TDB database (for testing)
      --desc=                Assembler description file
  General
      -v   --verbose         Verbose
      -q   --quiet           Run with minimal output
      --debug                Output information for debugging
      --help
      --version              Version information
      --strict               Operate in strict SPARQL mode (no extensions of any kind)

Upvotes: 1

Views: 528

Answers (1)

Wolfgang Fahl
Wolfgang Fahl

Reputation: 15604

@NLxDoDge - thx for pointing to my BITPlan success story. Indeed wikidata nt dumps may contain incompatible triples for import with Jena 4.1 - i ran into a similar problem today with the https://wdumps.toolforge.org/dump/1607 of human settlements.

A single triple:

<http://www.wikidata.org/entity/Q992883> <http://www.wikidata.org/prop/direct/P856> <http://www.sonora.gob.mx/portal/Runscript.asp?p=ASP\\pg239.asp> .

would spoil the show giving the error:

10:45:06 ERROR riot            :: [line: 6964090, col: 139] Illegal unicode escape sequence value: \\ (0x5C)
org.apache.jena.riot.RiotException: [line: 6964090, col: 139] Illegal unicode escape sequence value: \\ (0x5C)

I simply edited the 1.2 GB wdump-1607.nt file vim where you can jump to the line number with the

:6964090

and then save the file

:wq!

You might want to try out your environment with this small 100 MB dump file which expands to 1.2 GB before trying out the wikidata full import which in the end will need >2TB of SSD ! disk space to work.

Please find below the scripts I used for importing the dump and starting the fuseki server.

You should get a result like

tail -f tdb2--err.log 
10:57:24 INFO  loader          :: Loader = LoaderPhased
10:57:24 INFO  loader          :: Start: wdump-1607.nt
10:57:27 INFO  loader          :: Add: 500,000 wdump-1607.nt (Batch: 170,706 / Avg: 170,706)
10:57:29 INFO  loader          :: Add: 1,000,000 wdump-1607.nt (Batch: 255,102 / Avg: 204,540)
10:57:31 INFO  loader          :: Add: 1,500,000 wdump-1607.nt (Batch: 229,885 / Avg: 212,344)
10:57:33 INFO  loader          :: Add: 2,000,000 wdump-1607.nt (Batch: 245,579 / Avg: 219,780)
10:57:36 INFO  loader          :: Add: 2,500,000 wdump-1607.nt (Batch: 185,804 / Avg: 212,026)
10:57:39 INFO  loader          :: Add: 3,000,000 wdump-1607.nt (Batch: 146,627 / Avg: 197,355)
10:57:43 INFO  loader          :: Add: 3,500,000 wdump-1607.nt (Batch: 140,567 / Avg: 186,587)
10:57:46 INFO  loader          :: Add: 4,000,000 wdump-1607.nt (Batch: 142,166 / Avg: 179,573)
10:57:50 INFO  loader          :: Add: 4,500,000 wdump-1607.nt (Batch: 134,444 / Avg: 173,116)
10:57:53 WARN  riot            :: [line: 4869426, col: 86] Bad IRI: <http://:goku.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A component that is required by the scheme is missing.
10:57:54 INFO  loader          :: Add: 5,000,000 wdump-1607.nt (Batch: 143,307 / Avg: 169,589)
10:57:54 INFO  loader          ::   Elapsed: 29.48 seconds [2021/08/14 10:57:54 CEST]
10:57:57 INFO  loader          :: Add: 5,500,000 wdump-1607.nt (Batch: 139,314 / Avg: 166,303)
10:58:01 INFO  loader          :: Add: 6,000,000 wdump-1607.nt (Batch: 146,842 / Avg: 164,487)
10:58:04 INFO  loader          :: Add: 6,500,000 wdump-1607.nt (Batch: 143,678 / Avg: 162,674)
10:58:08 INFO  loader          :: Add: 7,000,000 wdump-1607.nt (Batch: 142,085 / Avg: 161,008)
10:58:11 INFO  loader          :: Add: 7,500,000 wdump-1607.nt (Batch: 144,300 / Avg: 159,775)
10:58:15 INFO  loader          :: Add: 8,000,000 wdump-1607.nt (Batch: 141,362 / Avg: 158,484)
10:58:18 INFO  loader          :: Add: 8,500,000 wdump-1607.nt (Batch: 141,083 / Avg: 157,343)
10:58:22 INFO  loader          :: Add: 9,000,000 wdump-1607.nt (Batch: 147,492 / Avg: 156,761)
10:58:23 INFO  loader          :: Finished: wdump-1607.nt: 9,179,041 tuples in 58.91s (Avg: 155,812)
10:58:32 INFO  loader          :: Finish - index SPO
10:58:32 INFO  loader          :: Start replay index SPO
10:58:32 INFO  loader          :: Index set:  SPO => SPO->POS, SPO->OSP
10:58:32 INFO  loader          :: Add: 1,000,000 Index (Batch: 8,928,571 / Avg: 8,928,571)
10:58:34 INFO  loader          :: Add: 2,000,000 Index (Batch: 508,130 / Avg: 961,538)
10:58:36 INFO  loader          :: Add: 3,000,000 Index (Batch: 381,388 / Avg: 638,026)
10:58:39 INFO  loader          :: Add: 4,000,000 Index (Batch: 370,233 / Avg: 540,321)
10:58:42 INFO  loader          :: Add: 5,000,000 Index (Batch: 362,450 / Avg: 492,029)
10:58:45 INFO  loader          :: Add: 6,000,000 Index (Batch: 370,644 / Avg: 466,562)
10:58:47 INFO  loader          :: Add: 7,000,000 Index (Batch: 367,647 / Avg: 449,293)
10:58:50 INFO  loader          :: Add: 8,000,000 Index (Batch: 366,166 / Avg: 436,895)
10:58:53 INFO  loader          :: Add: 9,000,000 Index (Batch: 380,952 / Avg: 429,881)
10:58:54 INFO  loader          :: Index set:  SPO => SPO->POS, SPO->OSP [9,174,324 items, 22.0 seconds]
10:58:58 INFO  loader          :: Finish - index OSP
10:58:59 INFO  loader          :: Finish - index POS
10:58:59 INFO  loader          :: Time = 94.428 seconds : Triples = 9,179,041 : Rate = 97,207 /s

script to run fuseki

#!/bin/bash
# WF 2020-06-25
# WF 2021-08-14
# Jena Fuseki server installation
# see https://jena.apache.org/documentation/fuseki2/fuseki-run.html
version=4.1.0
fuseki=apache-jena-fuseki-$version
if [ ! -d $fuseki ]
then
  if [ ! -f $fuseki.tar.gz ]
  then
    wget http://archive.apache.org/dist/jena/binaries/$fuseki.tar.gz
  else
    echo $fuseki.tar.gz already downloaded
  fi
  echo "unpacking $fuseki.tar.gz"
  tar xvfz $fuseki.tar.gz
else
  echo $fuseki already downloaded and unpacked
fi
cd $fuseki
java -jar fuseki-server.jar --tdb2 --loc=../data /wdhs

script to load data

#!/bin/bash
# WF 2020-10-05
# WF 2021-08-14

# global settings
jena=apache-jena-4.1.0
tgz=$jena.tar.gz
mirror=https://downloads.apache.org/jena/binaries
jenaurl=$mirror/$tgz
base=$(pwd)
#base=/hd/luxio/gnd
data=$base/data
tdbloader=$jena/bin/tdb2.tdbloader

getjena() {
# download
if [ ! -f $tgz ]
then
  echo "downloading $tgz from $jenaurl"
    wget $jenaurl
else
  echo "$tgz already downloaded"
fi
# unpack
if [ ! -d $jena ]
then
  echo "unpacking $jena from $tgz"
    tar xvzf $tgz
else
  echo "$jena already unpacked"
fi
# create data directory
if [ ! -d $data ]
then
  echo "creating $data directory"
  mkdir -p $data
else
  echo "$data directory already created"
fi
}

#
# show the given timestamp
#
timestamp() {
 local msg="$1"
 local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
 echo "$msg at $ts"
}

#
# load data for the given data dir and input
#
loaddata() {
    local data="$1"
    local input="$2"
  timestamp "start loading $input to $data"
  $tdbloader --loc "$data" "$input" > tdb2-$phase-out.log 2> tdb2-$phase-err.log
    timestamp "finished loading $input to $data"
}

getjena
export TMPDIR=$base/tmp
if [ ! -d $TMPDIR ]
then
  echo "creating temporary directory $TMPDIR"
  mkdir $TMPDIR
else
  echo "using temporary directory $TMPDIR"
fi
for file in *.nt 
do
  loaddata $data $file 
done

Upvotes: 2

Related Questions