JMnedict.xml file Trying to purge entries by xml tag?

JMnedict is a Japanese Name file that's free online in xml. I have not found another way it was stored. It is free to use as long as credit is given. Found here: https://www.edrdg.org/enamdict/enamdict_doc.html (I did not write the code below, I only copy-pasted it).

I get the principles of xml and why it exists, but nothing beyond that. I am not that savvy to specific programs that handle them besides text editors and maybe OpenOffice.

I downloaded it for the Japanese names. I want to filter the entries only to the fem, masc, given, surname, and unclass

I hope to make a Japanese name generator from the data and this is the best source material for it. (I made a Korean one already).

The thing is that the list of entities is:

character, company name, creature, deity, document, event, female given name or forename, fiction, given name, group, legend, male given name or forename, mythology, object, organization name, other, person, place , product name, religion, service, ship name, railway station, family or surname, unclassified name, work.

And the xml file is 152.3 MB.

I'd like to drop all of the entries that do not fall into the categories I want.

I'm looking for an efficient way to trim the file, so I can upload it to the online database.

Structure of the entries looks like this (Not sure if it helps):

<entry>
<ent_seq>5000000</ent_seq>
<k_ele>
<keb>ゝ泉</keb>
</k_ele>
<r_ele>
<reb>ちゅせん</reb>
</r_ele>
<trans>
<name_type>&given;</name_type>
<trans_det>Chusen</trans_det>
</trans>
</entry>

(as an example)

and trim out entries such as this:

<entry>
<ent_seq>5000198</ent_seq>
<k_ele>
<keb>あかり博物館</keb>
</k_ele>
<r_ele>
<reb>あかりはくぶつかん</reb>
</r_ele>
<trans>
<name_type>&place;</name_type>
<trans_det>Akari Museum</trans_det>
</trans>
</entry>

So is there an efficient way to purge all of the entries I don't want in BBedit, something similar or a method I'm not familiar with? I need to trim the file so I can upload it to my database (and hopefully it being shorter will help it upload.)

I'm on a Mac if you're suggesting applications.

OR is there a method I've not thought of to purge the entries in a different method?

What I've Tried:

Upvotes: 0

Views: 39

Answers (1)

Graham Asher
Graham Asher

Reputation: 1780

I use JMnedict.xml in my CartoType map rendering system to convert Japanese placenames from their Japanese-script form to a Roman transcription. I load the whole file in using the open-source RapidXml XML parser, using this C++ code, having #include'd <rapidxml.hpp>:

std::string filename { aFileName };
std::ifstream file { filename.c_str(), std::ifstream::binary };
file.seekg(0,file.end);
std::streampos length = file.tellg();
file.seekg(0,file.beg);
std::vector<char> file_text(length);
file.read(file_text.data(),length);
file_text.push_back(0);

rapidxml::xml_document<> doc;
try
    {
    doc.parse<0>(&file_text[0]);
    }
catch (const rapidxml::parse_error& e)
    {
    const char* start = &file_text[0];
    const char* w = e.where<char>();
    std::ptrdiff_t byte_index = w - start;
    printf("error parsing Japanese dictionary file '%s' at byte %lld\n",filename.c_str(),(long long)byte_index);
    exit(1);
    }

It is easy to traverse every entry in the file and write out a new file with only the entries you need. Here is some code to do the traversal:

// Check that the top-level object is a <JMnedict> element.
auto top_node = doc.first_node();
if (top_node == nullptr || strcmp(top_node->name(),"JMnedict"))
    {
    printf("Japanese dictionary has no <JMnedict> element\n");
    exit(1);
    }

// Traverse the <entry> elements.
for (auto entry_node = top_node->first_node("entry"); entry_node; entry_node = entry_node->next_sibling("entry",5))
    {
    // YOUR FILTER CODE HERE
    }

Upvotes: 0

Related Questions