How to partition RDF triples in Java based on subject

Question

For partitioning RDF triples by subject, I use String.hashCode() of the subject and put the triple in the corresponding partition. The goal is to be able to process the partitioned files in memory(processing large file may not be possible).

Now in order to have a restricted number of partitions, i do the following. Assuming that we want to have 10 partitions, out of a large RDF file:

    String subject; 
    partitionFileName = subject.hashCode / (Integer.MAX_VALUE/10)

Therefore all the triples with the same subjects will be in one partition and overall we will have 10 partitions.

Now the problem is, when the triples have different distributions, it can lead to very big or very small partitions which are undesired.

Does anybody have any suggestion?

Thank you in advance.

Esoteric Screen Name · Accepted Answer

Algorithm:

Create a partition for each subject (this can be done on the fly during RDF processing)
For each triple, assign it to a partition according to the subject and memoize the subject-partition mapping
While number of partitions > 10, merge the two smallest partitions and update the map

Pros:

Ensures all triples with the same subject are in the same partition
Balanced as long as your data isn't terribly lopsided
You don't have to use the hashcode if you don't want to

Cons:

Additional processing time, though not an onerous amount; this is O(n * m) where n is the number of triples and m is the number of distinct subjects
Disparate partition sizes if your data is greatly uneven, but this is unavoidable given that you want all triples with the same subject in the same partition
You have to maintain the mapping to perform lookups against, but this is ultimately trivial to do and a constant time operation

If you don't care about keeping same-subject triples within a single partition, then just create ten buckets and fill them round-robin. O(n) and as balanced as possible.

How to partition RDF triples in Java based on subject

Answers (2)

Related Questions