JNMN
JNMN

Reputation: 165

Optimizing searching techniques for large data sets

I'm currently working on a project where I need to work with a .csv file that is around 3 million lines long and different .xlsx files that range in size between 10 lines long and over 1000 lines long. I'm trying to find commonalities between different cells in my .xlsx file and my .csv file. To do this. I've read in my .csv file and .xslx file and stored both in ArrayLists. I have what I want to work, working however the method I'm using is O(n^3) using a 3 nested for loop do search between each.

//This is our .xlsx file stored in an ArrayList
for(int i = 1; i<finalKnowledgeGraph.size(); i+=3) {
            //loop through our knowledgeGraph again
            for(int j = 1; j<finalKnowledgeGraph.size(); j+=3) {
                //loop through .csv file which is stored in an ArrayList
                for(int k=1; k<storeAsserions.size(); k++) {
                   if(finalKnowledgeGraph.get(i).equals(storeAsserions.get(k)) && finalKnowledgeGraph.get(j+1).equals(storeAsserions.get(k+1))){
                      System.out.println("Do Something");
                   } else if(finalKnowledgeGraph.get(i+1).equals(storeAsserions.get(k)) && finalKnowledgeGraph.get(j).equals(storeAsserions.get(k+1))) {
                       System.out.println("Do something else");
                   }
                }
            }
        }

At the moment in my actual code, my System.out.println("Do something") is just writing specific parts of each file to a new .csv file.

Now, with what I'm doing out of the way my problem is optimization. Obviously if im running a 3 nested for loop over millions of inputs, it won't be finished running in my life time so I'm wondering what ways I can optimize the code.

One of my friends suggested storing the files in memory and so read/writes will be several times quicker. Another friend suggested storing the files in hashtables instead of ArrayLists to help speed up the process but since I'm essentially searching EVERY element in said hashtable I don't see how that's going to speed up the process. It just seems like it's going to transfer the searching from one data structure to another. However I said i'd also post the question here and see if people had any tips/suggestions on how I'd go about optimizing this code. Thanks

Note: I have literally no knowledge of optimization etc. myself and I found other questions on S/O too specific for my knowledge on the field so if the question seems like a duplicate, I've probably seen the question you're talking about already and couldn't understand the content

Edit: Everything stored in both ArrayLists is verb:noun:noun pairs where I'm trying to compare nouns between each ArrayList. Since I'm not concerned with verbs, I start searching at index 1. (Just for some context)

Upvotes: 0

Views: 335

Answers (1)

maaartinus
maaartinus

Reputation: 46422

One possible solution would be using a database, which -- given a proper index -- could do the search pretty fast. Assuming the data fit in memory, you can be even faster.

The principle

For problems like

for (X x : xList) {
    for (Y y : yList) {
        if (x.someAttr() == y.someAttr()) doSomething(x, y);
    }
}

you simply partition one list into buckets according to the attribute like

Map<A, List<Y>> yBuckets = new HashMap<>();
yList.forEach(y -> yBuckets.compute(y.someAttr(), (k, v) ->
    (v==null ? new ArrayList<>() : v).add(y));

Now, you iterate the other list and only look at the elements in the proper bucket like

for (X x : xList) {
    List<Y> smallList = yBucket.get(x.someAttr());
    if (smallList != null) {
        for (Y y : smallList) {
            if (x.someAttr() == y.someAttr()) doSomething(x, y);
        }
    }
}

The comparison can be actually left out, as it's always true, but that's not the point. The speed comes from eliminating to looking at cases when equals would return false.

The complexity gets reduced from quadratic to linear plus the number of calls to doSomething.

Your case

Your data structure obviously does not fit. You're flattening your triplets into one list and this is wrong. You surely can work around it somehow, but creating a class Triplet {String verb, noun1, noun2} makes everything simpler. For storeAsserions, it looks like you're working with pairs. They seem to overlap, but that may be a typo, anyway it doesn't matter. Let's use Triplets and Pairs.

Let me also rename your lists, so that the code fits better in this tiny window:

for (Triplet x : fList) {
    for (Triplet y : fList) {
        for (Pair z : sList) {
            if (x.noun1.equals(z.noun1) && y.noun2.equals(z.noun2)) {
                doSomething();
            } else if (x.noun2.equals(z.noun1) && y.noun1.equals(z.noun2)) {
                doSomethingElse();
            }
        }
    }
}

Now, we need some loops over buckets, so that at least one of the equals tests is always true, so that we save us dealing with non-matching data. Let's concentrate on the first condition

x.noun1.equals(z.noun1) && y.noun2.equals(z.noun2)

I suggest a loop like

for (Pair z : sList) {
    for (Triplet x : smallListOfTripletsHavingNoun1SameAsZ) {
        for (Triplet y : smallListOfTripletsHavingNoun2SameAsZ) {
            doSomething();
        }
    }
}

where the small lists get computes like in the first section.

No non-matching entries get ever compared, so the complexity gets reduced from cubic to the number of matches (= to the number if lines you code would print).

Addendum -yBuckets

Let's assume xList looks like

[
  {id: 1, someAttr: "a"},
  {id: 2, someAttr: "a"},
  {id: 3, someAttr: "b"},
]

Then yBuckets should be

{
  "a": [
    {id: 1, someAttr: "a"},
    {id: 2, someAttr: "a"},
  ],
  :b": [
    {id: 3, someAttr: "b"},
  ],
}

One simple way, how to create such a Map is

yList.forEach(y -> yBuckets.compute(y.someAttr(), (k, v) ->
   (v==null ? new ArrayList<>() : v).add(y));

In plaintext:

  • For each y from yList,
  • get a corresponding map entry in the form of (k, v),
  • when v is null, then create a new List
  • otherwise work with the List v
  • In any case, add y to it
  • and store it back to the Map (which is a no-op unless when a new List was created in the third step).

Upvotes: 3

Related Questions