anupam mishra
anupam mishra

Reputation: 75

DBPedia extraction framework failure during extraction of DBPedia Dump

While working on DBpedia extraction framework, I am facing issues with the csv files from the Core Dataset. I'm interested in extracting data (in my case, abstract of all company's wikipedia page) from dbpedia dumps (RDF format). I'm following the instructions from DBpedia Abstract Extractioin Step-by-step Guide

Commands used:

$ git clone git://github.com/dbpedia/extraction-framework.git 
$ cd extraction-framework 
$ mvn clean install 
$ cd dump 
$ ../run download config=download.minimal.properties 
$ ../run extraction extraction.default.properties

I get the below error when executing the last command "./run extraction extraction.properties.file". Can anyone point out wh at mistake am I making. Is there any specific csv file i need to process or some configur ation issue. I have the full "mediawiki-1.24.1".

Also please note th at pages-articles.xml.bz2, I download it partially upto 256MB only. Please help

parsing /opt/extraction-framework-master/DumpsD    ata/wikid    atawiki/20150113/wikipedias.csv
java.lang.reflect.Invoc    ationTargetException
    at sun.reflect.N    ativeMethodAccessorImpl.invoke0(N    ative Method)
    at sun.reflect.N    ativeMethodAccessorImpl.invoke(N    ativeMethodAccessorImpl.java:62)
    at sun.reflect.Deleg    atingMethodAccessorImpl.invoke(Deleg    atingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at scala_maven_executions.MainHelper.runMain(MainHelper.java:164)
    at scala_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
Caused by: java.lang.Exception: expected [15] fields, found [1] in line [%21%21%21 http://www.w3.org/2000/01/rdf-schema#label !!! l]
    at org.dbpedia.extraction.util.WikiInfo$.fromLine(WikiInfo.scala:60)
    at org.dbpedia.extraction.util.WikiInfo$$anonfun$fromLines$1.apply(WikiInfo.scala:49)
    at org.dbpedia.extraction.util.WikiInfo$$anonfun$fromLines$1.apply(WikiInfo.scala:49)
    at scala.collection.Iter    ator$class.foreach(Iter    ator.scala:743)
    at scala.collection.AbstractIter    ator.foreach(Iter    ator.scala:1195)
    at org.dbpedia.extraction.util.WikiInfo$.fromLines(WikiInfo.scala:49)
    at org.dbpedia.extraction.util.WikiInfo$.fromSource(WikiInfo.scala:36)
    at org.dbpedia.extraction.util.WikiInfo$.fromFile(WikiInfo.scala:27)
    at org.dbpedia.extraction.util.ConfigUtils$.parseLanguages(ConfigUtils.scala:83)
    at org.dbpedia.extraction.dump.sql.Import$.main(Import.scala:29)
    at org.dbpedia.extraction.dump.sql.Import.main(Import.scala)

Upvotes: 1

Views: 377

Answers (1)

anupam mishra
anupam mishra

Reputation: 75

i was facing above issue because of incomplete download of enwiki-20150205-pages-articles.xml.bz2 file using

$ ../run download config=download.minimal.properties

but yet failing to resolve abstract extraction issue as i am expecting long abstract from bdpedia dump.

$ ../run extraction extraction extraction.abstracts.properties

it builds completely and perform extraction over 1 cr+ pages but not reflecting any data in long_abstracts_en.nt

i followed instruction to put mediawiki php and mysql etc.

Upvotes: 0

Related Questions