Reputation: 579
For partitioning RDF triples by subject, I use String.hashCode() of the subject and put the triple in the corresponding partition. The goal is to be able to process the partitioned files in memory(processing large file may not be possible).
Now in order to have a restricted number of partitions, i do the following. Assuming that we want to have 10 partitions, out of a large RDF file:
String subject;
partitionFileName = subject.hashCode / (Integer.MAX_VALUE/10)
Therefore all the triples with the same subjects will be in one partition and overall we will have 10 partitions.
Now the problem is, when the triples have different distributions, it can lead to very big or very small partitions which are undesired.
Does anybody have any suggestion?
Thank you in advance.
Upvotes: 0
Views: 184
Reputation: 6112
Algorithm:
Pros:
Cons:
If you don't care about keeping same-subject triples within a single partition, then just create ten buckets and fill them round-robin. O(n) and as balanced as possible.
Upvotes: 3
Reputation: 5552
You can simply use the modulo to split the partitions:
subject.hashCode() % 10
will split more or less evenly across the ten partitions.
Upvotes: 0