Mike
Mike

Reputation: 377

BaseX: Inserting nodes performance problems

I am experiencing some performance problems when inserting XML nodes to existing nodes in a BaseX database.

Usecase

I have one big XML file (about 2GB) from which I created a BaseX database. The XML looks like this (simplified). It has about 350.000 <record>s:

<collection>

  <record>
    <id>ABC007</id>
    <title>The title of the record</title>
    <author>Joe Lastname</author>
    ... [other information]
  </record>

  <record>
    <id>ABC555</id>
    <relation_id>ABC007</relation_id>
    <title>Another title</title>
    <author>Sue Lastname</author>
    ... [other information]
  </record>

  ... [many other <record>s]

</collection>

The <record>s are related to each other. The <relation_id> in one record points to an <id> in another record (see example above).

What I am doing in BaseX is inserting information from one related record to the other one and vice versa. So, the result looks like this:

<collection>

  <record>
    <id>ABC007</id>
    <title>The title of the record</title>
    <author>Joe Lastname</author>
    ... [other information]
    <related_record> <!-- Insert this information -->
        <title>Another title</title>
        <author>Sue Lastname</author>
    </related_record>
  </record>

  <record>
    <id>ABC555</id>
    <relation_id>ABC007</relation_id>
    <title>Another title</title>
    <author>Sue Lastname</author>
    ... [other information]
    <related_record> <!-- Insert this information -->
        <title>The title of the record</title>
        <author>Joe Lastname</author>
    </related_record>
  </record>

  ... [many other <record>s that should be enriched with other records data]

</collection>

I am doing that with the following Java code:

// Setting some options and variables
Context context = new Context();
new Set(MainOptions.AUTOFLUSH, false).execute(context);
new Set(MainOptions.AUTOOPTIMIZE, false).execute(context);
new Set(MainOptions.UPDINDEX, true).execute(context);

// Opening the database
new Open('database_name').execute(context);

// Get all records with <relation_id> tags. These are the "child" records and they contain the "parent" record ID.
String queryParentIdsInChild = "for $childRecord in doc('xmlfile.xml')//record[relation_id]
                                return db:node-id($childRecord)"

// Iterate over the child records and get the parent record ID
QueryProcessor parentIdsInChildProc = new QueryProcessor(queryParentIdsInChild, context);
Iter iter = parentIdsInChildProc.iter();
parentIdsInChildProc.close();

for(Item childRecord; (childRecord = iter.next()) != null;) {
    // Create a pointer to the child record in BaseX for convenience
    String childNodeId = childRecord.toString();
    String childNode = "db:open-id('database_name', " + childNodeId + ")";

    // Get some details from the child record. They should be added to the parent record.
    String queryChildDetails = "let $title := data("+childNode+"/title)"
        + " let $author := data("+childNode+"/author)"
        + " return "
        + "<related_record>"
        + "  <title>{$title}</title>"
        + "  <author>{$author}</author>"
        + "</related_record>";
    String childDetails = new XQuery(queryChildDetails).execute(context);   

    // Create a pointer to the parent record in BaseX for convenience
    parentNode = (... similar procedure like getting the child node, therefore skiping that code here)

    // PERFORMANCE ISSUE HERE!!!
    // Insert the child record details to the parent node
    String parentUpdate = "insert node " + childDetails + " into " + parentNode;
    new XQuery(parentUpdate).execute(context);
}

... flushing and optimizing code here

Problem

The problem is that I experience massive performance problems when inserting the new nodes to a <record>. In a smaller test database with about 10.000 <record>s, the inserts are executed quite fast - in about 7 seconds. When I run the same code in my production database with about 350.000 <record>s, a single insert operation takes several seconds, some even minutes! And there would be thousands of these inserts, so it definitely takes too long.

Questions

I'm very new to BaseX and I am for sure not the most experienced Java programmer. Maybe I'm just overlooking something or making some stupid mistake. So I'm asking if someone has a hint for me. What could be the problem? Is it the Java code? Or is the BaseX database with 350.000 <record>s just too big for insert operations? If yes: Is there a workaround? Or is BaseX (or XML databases in general) not the right tool for this usecase?

Further Information

I am using BaseX 9.0.2 in stand-alone mode on Ubuntu 18.04. I have done an "Optimize All" before running the above mentioned code.

Upvotes: 1

Views: 384

Answers (1)

Mike
Mike

Reputation: 377

I think I didn't run the optimize correctly. After I optimized again the insert commands ran very fast. Now, about 10000 inserts are executing within under a second. Maybe it also helped that I deactivated UPDINDEX and AUTOOPTIMIZE.

Upvotes: 1

Related Questions