Reputation: 45
I have problem with importing big xml file (1.3 gb) into mongodb in order to search for most frequent words in map & reduce manner.
http://dumps.wikimedia.org/plwiki/20141228/plwiki-20141228-pages-articles-multistream.xml.bz2
Here I enclose xml cut (first 10 000 lines) out from this big file:
http://www.filedropper.com/text2
I know that I can't import xml directly into mongodb. I used some tools do so. I used some python scripts and all has failed.
Which tool or script should I use? What should be a key & value? I think the best solution to find most frequent world would be this.
(_id : id, value: word )
then I would sum all the elements like in docs example:
http://docs.mongodb.org/manual/core/map-reduce/
Any clues would be greatly appreciated, but how to import this file into mongodb to have collections like that?
(_id : id, value: word )
If you have any idea please share.
Edited After research, I would use python or js to complete this task.
I would extract only words in <text></text>
section which is under /<page><revision>
, exlude <, > etc., and then separate words and upload them to mongodb with pymongo or js.
So there are several pages with revision and text.
Edited
Upvotes: 2
Views: 12805
Reputation: 1167
The XML file i'm using goes this way :
<labels>
<label>
<name>Bobby Nice</name>
<urls>
<url>www.examplex.com</url>
<url>www.exampley.com</url>
<url>www.examplez.com</url>
</urls>
</label>
...
</labels>
and i can import it using xml-stream
with mongodb
Code:
var XmlStream = require('xml-stream');
// Pass the ReadStream object to xml-stream
var stream = fs.createReadStream('20080309_labels.xml');
var xml = new XmlStream(stream);
var i = 1;
var array = [];
xml.on('endElement: label', function(label) {
array.push(label);
db.collection('labels').update(label, label, { upsert:true }, (err, doc) => {
if(err) {
process.stdout.write(err + "\r");
} else {
process.stdout.write(`Saved ${i} entries..\r`);
i++;
}
});
});
xml.on('end', function() {
console.log('end event received, done');
});
Upvotes: 0
Reputation: 7559
To save all this data, save them on Gridfs
And the easiest way to convert the xml
, is to use this tool to convert it to json
and save it:
https://stackoverflow.com/a/10201405/861487
import xmltodict
doc = xmltodict.parse("""
... <mydocument has="an attribute">
... <and>
... <many>elements</many>
... <many>more elements</many>
... </and>
... <plus a="complex">
... element as well
... </plus>
... </mydocument>
... """)
doc['mydocument']['@has']
Out[3]: u'an attribute'
Upvotes: 1