banncee
banncee

Reputation: 969

Suggestions needed for optimizing O(n^2) algorithm

I am looking to optimize a fairly simple algorithm that is currently O(n2). I have a file of records, where each one needs to be compared to every other in the same file. If the two are the 'same' (comparator function is pretty complicated), the matched records are output. Note that there may be several records that match each other, and there is no sense of order - only if the match is True or False.

Pseudo code:


For (outRec in sourceFile) {
  Get new filePointer for targetFile //starting from the top of the file for inner loop
  For (inRec in targetFile) {
    if (compare(outRec, inRec) == TRUE ) {
      write outRec
      write inRec
    }
    increment some counters
  }
  increment some other counters
}

The data is not sorted in any way, and there is no preprocessing possible to order the data.

Any ideas on how this could become something less than O(n2)? I am thinking of applying the MapReduce paradigm on the code, breaking up the outer AND inner loops, possibly using a chained Map function. I am pretty sure I have the code figured out on Hadoop, but wanted to check alternatives before I spent time coding it.

Suggestions appreciated!

Added: Record types. Basically, I need to match names/strings. The types of matching are shown in the example below.


1,Joe Smith,Daniel Foster
2,Nate Johnson,Drew Logan
3,Nate Johnson, Jack Crank
4,Joey Smyth,Daniel Jack Foster
5,Joe Morgan Smith,Daniel Foster

Expected output: Records 1,4,5 form a match set End of output

Added: these files will be quite large. The largest file is expected to be around 200 million records.

Upvotes: 7

Views: 9104

Answers (10)

han
han

Reputation: 755

As noted by btilly, you don't actually need transitivity to classify the records. In the case of English people's names, you could represent each name by two initials, and each record by a sorted list of initials. Then you only need to run the full O(N^2) comparison between records in the same class. There is the additional issue that the same pair of records can appear in multiple classes, but it is easy to detect by maintaining a separate collection of matched record pairs (identified by record indices).

In the example, you'd put record 1 in class "DF,JS", record 2 in class "DL,NJ", record 3 in class "JC,NJ", record 4 in classes "DJ,JS", "JF,JS" and "DF,JS", and record 5 in classes "DF,JM", "DF,JS" and "DF,MS". You get a total of 7 classes: "DF,JM", "DF,MS", "DF,JS", "DJ,JS", "DL,NJ", "JC,NJ", "JF,JS", of which only class "DF,JS" contains multiple records, namely records 1, 4 and 5. Thus in this example you only need to run the full comparison function twice.

On the other hand, there is the problem that people have weird names. This blog entry on the subject is worth a look if you haven't seen it before. Whatever you do, you are going to miss some matches.

Upvotes: 1

banncee
banncee

Reputation: 969

Thanks to all the great suggestions.
After going through options, it seems the best approach given my timeline is to use a MapReduce framework to parallelize the problem and throw more hardware at it. I realize this doesn't reduce the O(n2) complexity.
The only possible solution I can think of is to run some type of minHash on the data, stripe the data into overlapping sections, and compare within the stripes and overlaps. This should reduce the number of comparisons, but I'm not sure how expensive the hashing run is going to be.

Upvotes: 0

comingstorm
comingstorm

Reputation: 26117

If you really can't do any better than an opaque equivalence relation, then your worst case will always be O(n^2) -- e.g., for the case where there are no matches, you will need to compare every pair to make sure of that. (as people have mentioned, you can parallellize this, but that's still not going to be particularly tractable for hundreds of millions of records; it may not be worth the cost of the resources needed to do it that way).

Usually, though, there is some better way to do it.

If you really do have an equivalence relation (that is, if you have some logical guarantee that if match(a,b)=match(b,c)=true, then match(a,c) is also true), there is probably some canonical form that you can convert your records to, which is amenable to hashing and/or ordering.

In your example, you seem to be matching on variants of "Joe Smith". If that is the case, you can probably augment your comparison criteria to choose one particular member of the equivalence class to represent the whole. E.g., choose "JOSEPH" to represent all names equivalent to "Joe", "SMITH" to represent all names equivalent to "Smythe", etc.

Once you make that transformation, you can use a hash table to reduce your operation to O(n), rather than O(n^2).

Upvotes: 0

Mark Ransom
Mark Ransom

Reputation: 308530

You haven't mentioned what percentage of the input is expected to match, or how often you might get an exact match versus an inexact one. If you can do a little pre-processing to reduce the problem size, it might be a big help.

If you simply sort the input and run your comparison function on adjacent entries, you might cull enough duplicates to make the n^2 second pass bearable.

Upvotes: 0

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

As you've already mentioned, you won't have the luck that it is going to be better than O(n^2), but you can parallize this.

I have a working solution which will work with HDFS, you can extend this with using distributed cache.

public class MatchImporter extends Mapper<LongWritable, Text, Text, Text> {

FileSystem fs;
private BufferedReader stream;

@Override
protected void setup(Context context) throws IOException,
        InterruptedException {
    fs = FileSystem.get(context.getConfiguration());
}

private void resetFile() throws IOException {
    if (stream != null)
        stream.close();
    stream = new BufferedReader(new InputStreamReader(fs.open(new Path(
            "files/imp/in/target.txt"))));
}

private boolean compare(Text in, String target) {
    return target.contains(in.toString());
}

enum Counter {
    PROGRESS
}

@Override
protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {

    resetFile();
    String line = null;
    while ((line = stream.readLine()) != null) {
        // increment a counter to don't let the task die
        context.getCounter(Counter.PROGRESS).increment(1);
        context.progress();
        if (compare(value, line)) {
            context.write(new Text(line), value);
        }
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf);

    job.setMapperClass(MatchImporter.class);
    job.setReducerClass(Reducer.class);
    job.setJarByClass(MatchImporter.class);

    Path in = new Path("files/imp/in/source.txt");
    Path out = new Path("files/imp/out/");

    FileInputFormat.addInputPath(job, in);
    FileSystem fs = FileSystem.get(conf);
    if (fs.exists(out))
        fs.delete(out, true);

    SequenceFileOutputFormat.setOutputPath(job, out);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    job.waitForCompletion(true);
}

}

Using the input in the source.txt:

thomas
phil
jen
james
christian
stefan
stephan
john

and target.txt

john meat
jay hardly

will result in reducer output of:

john meat   john

The trick is that you can split your source.txt and do the compare stuff in parallel. This will give you the speedup, but won't get you better in big O.

One big note here: You have to report progress using a counter, because the compare against a whole file can take forever. This will prevent your task from failing in distributed environment.

Little tip: Try to split your source.txt into 64m chunks and make the target.txt to a sequencefile. This will gain a lot of speedup, you have to rewrite the reading things then.

Wish you the best of luck!

Upvotes: 1

btilly
btilly

Reputation: 46507

FYI MapReduce will not imrove the algorithmic complexity of the solution. It adds some overhead but then parallelizes it so that you can use the necessary resources in less wall-clock time.

To improve your wall-clock time the #1 thing to do is to find ways to avoid having to run the comparison. Any way of doing that will be a win. And even if your comparison logic is complex, you can still use sorting to help.

For instance suppose that you have some dimension that that data is spread out in. Data that varies by too much in that dimension is guaranteed to not compare equal, though being close in that dimension does not guarantee equality. Then what you can do is sort your data by that dimension, and then only run comparisons between elements that are close in that dimension. Voila! Most of the O(n*n) comparisons are now gone.

Let's make it more complex. Suppose you can identify two such dimensions that are independent from each other. Sort your data along the first such dimensions. Divide data in the first dimension into strips. (Make the strips overlap by the maximum they can vary in that dimension and still compare equal.) Now take each strip and sort it by the second dimension. Then run comparisons between pairs of elements that are acceptably close in that dimension, and include the pair in your answer if it compares equal, and this is the first strip it could appear in. (That dedup logic is needed because overlap may mean that a pair that compares equal can appear in multiple strips.) This is likely to be even better than the first approach because you've managed to narrow things down so that you're only comparing rows with a small number of "nearby" rows.

If you want to use less resources, you need to focus on ways to avoid having to actually make individual comparisons. Anything that you come up with on that path will help.

Upvotes: 2

Frank
Frank

Reputation: 10571

I am not sure about the properties of your comparator and the data set, but assuming that your comparator defines an equivalence relation on your rows, here goes nothing:

  1. Create a map for the input file and use the comparator function as the key comparator of the map. The map values are a sequence/list of rows, i.e. all rows that are 'same' get successively added to the same map entry). Takes O(n*log n) time.
  2. Walk through the other file's rows and check if each row matches a key in the map. In that case, due to the equivalence relation implied by your comparator you know that this row is the 'same' as all the rows in the value of that map entry. Takes O(n* log n + C), depending on how many matches you have to output.

Note that in the worst case, according to your problem description, you cannot get any better than O(n^2), simply because there may be O(n^2) results of matching records that you have to output!

Upvotes: 4

Dennis
Dennis

Reputation: 2616

We'd need to know more about your comparison function. Is your comparison transitive? (That is, does A==B and B==C imply A==C?) Is it reflexive? (Does A==B imply B==A?)

If your comparison function is transitive and reflexive, and many records being equal is common, then you could bin your records into groups by comparing them to one "representative sample" of the group. That could approach O(N) in the best case.

Note that hashing the records assumes hash(A) == hash(B) <=> compare(A, B) == true, but if compare(A, B) can be true even when bytes(A) != bytes(B) it might be tricky to design an appropriate hashing algorithm.

Upvotes: 2

tskuzzy
tskuzzy

Reputation: 36476

Just go through each record of your file and insert them into a hash table. At each step, check to see if the record is already in the hash table. If it is, then output it. This can be done in O(n).

Upvotes: 1

Joe
Joe

Reputation: 42666

Assuming the files aren't ridiculously large, I'd go through the file in its entirety, and compute a hash for the row, and keep track of hash/line # (or file pointer position) combinations. Then sort the list of hashes, and identify those that appear more than once.

Upvotes: 2

Related Questions