Reputation: 1
JMnedict is a Japanese Name file that's free online in xml. I have not found another way it was stored. It is free to use as long as credit is given. Found here: https://www.edrdg.org/enamdict/enamdict_doc.html (I did not write the code below, I only copy-pasted it).
I get the principles of xml and why it exists, but nothing beyond that. I am not that savvy to specific programs that handle them besides text editors and maybe OpenOffice.
I downloaded it for the Japanese names. I want to filter the entries only to the fem, masc, given, surname, and unclass
I hope to make a Japanese name generator from the data and this is the best source material for it. (I made a Korean one already).
The thing is that the list of entities is:
character, company name, creature, deity, document, event, female given name or forename, fiction, given name, group, legend, male given name or forename, mythology, object, organization name, other, person, place , product name, religion, service, ship name, railway station, family or surname, unclassified name, work.
And the xml file is 152.3 MB.
I'd like to drop all of the entries that do not fall into the categories I want.
I'm looking for an efficient way to trim the file, so I can upload it to the online database.
Structure of the entries looks like this (Not sure if it helps):
<entry>
<ent_seq>5000000</ent_seq>
<k_ele>
<keb>ゝ泉</keb>
</k_ele>
<r_ele>
<reb>ちゅせん</reb>
</r_ele>
<trans>
<name_type>&given;</name_type>
<trans_det>Chusen</trans_det>
</trans>
</entry>
(as an example)
and trim out entries such as this:
<entry>
<ent_seq>5000198</ent_seq>
<k_ele>
<keb>あかり博物館</keb>
</k_ele>
<r_ele>
<reb>あかりはくぶつかん</reb>
</r_ele>
<trans>
<name_type>&place;</name_type>
<trans_det>Akari Museum</trans_det>
</trans>
</entry>
So is there an efficient way to purge all of the entries I don't want in BBedit, something similar or a method I'm not familiar with? I need to trim the file so I can upload it to my database (and hopefully it being shorter will help it upload.)
I'm on a Mac if you're suggesting applications.
OR is there a method I've not thought of to purge the entries in a different method?
What I've Tried:
I tried to upload the xml file to my database online directly, but it was too big, even when zipped. (And then I would do MySQL DROPS to remove the offending entries), but file was too big and phpMyAdmin gave up and spat out an error that the file was incomplete.
I did try trimming the file to a few entries for phpMyAdmin, and that uploaded successfully, but uploading it in sections seems inefficient, especially with the file so long and it having entries I don't want anyway.
I tried to convert it to a csv file, but it is too big and the automatic converters online can't handle it. (Converting it to JSON or similar has similar problems)
I tried OpenOffice, but again, it can't handle the file being that large and the application crashed on me. (I thought I could Sort then kill)
I can open it in BBEDIT and manually trim the file, but the file is very long and not organized, so it is super inefficient. I rather be able to sort the file by name_type. And kill the entries that don't belong.
Upvotes: 0
Views: 39
Reputation: 1780
I use JMnedict.xml in my CartoType map rendering system to convert Japanese placenames from their Japanese-script form to a Roman transcription. I load the whole file in using the open-source RapidXml XML parser, using this C++ code, having #include
'd <rapidxml.hpp>
:
std::string filename { aFileName };
std::ifstream file { filename.c_str(), std::ifstream::binary };
file.seekg(0,file.end);
std::streampos length = file.tellg();
file.seekg(0,file.beg);
std::vector<char> file_text(length);
file.read(file_text.data(),length);
file_text.push_back(0);
rapidxml::xml_document<> doc;
try
{
doc.parse<0>(&file_text[0]);
}
catch (const rapidxml::parse_error& e)
{
const char* start = &file_text[0];
const char* w = e.where<char>();
std::ptrdiff_t byte_index = w - start;
printf("error parsing Japanese dictionary file '%s' at byte %lld\n",filename.c_str(),(long long)byte_index);
exit(1);
}
It is easy to traverse every entry in the file and write out a new file with only the entries you need. Here is some code to do the traversal:
// Check that the top-level object is a <JMnedict> element.
auto top_node = doc.first_node();
if (top_node == nullptr || strcmp(top_node->name(),"JMnedict"))
{
printf("Japanese dictionary has no <JMnedict> element\n");
exit(1);
}
// Traverse the <entry> elements.
for (auto entry_node = top_node->first_node("entry"); entry_node; entry_node = entry_node->next_sibling("entry",5))
{
// YOUR FILTER CODE HERE
}
Upvotes: 0