PS1
PS1

Reputation: 135

Efficient algorithm to find matching common words or phrases from a corpus

I am trying to find an efficient way to find common phrases. I think I can explain better with an example.

Input: Consider each line as sentence

B
B C
A B C B
D E
F D E

Output:

B
D E

Line 2 and 3 are eliminated as B (line 1) is common to them. Line 5 is eliminated as Line 4 is common.

I hope I have explained it!

I can run O(n^2) by doing a match. Appreciate anything better.

Update: Please consider the order (Eg D E should match sentence F D E. E D should not.)

Upvotes: 0

Views: 431

Answers (1)

Mike Elofson
Mike Elofson

Reputation: 2037

The fastest way I can think to do it would be something like:

public static void main(String[] args) throws Exception {

    List<String> toOutput = new ArrayList<String>();
    BufferedReader br = new BufferedReader(new FileReader("input.txt"));
    String line;
    while ((line = br.readLine()) != null) {
        boolean add = true;

        for (int i = 0; i < toOutput.size(); i++) {
            if (toOutput.get(i).contains(line)) {
                toOutput.remove(i);
            } else if (line.contains(toOutput.get(i))) {
                add = false;
                break;
            }
        }

        if (add) {
            toOutput.add(line);
        }
    }
    br.close();

    for (String s : toOutput) {
        System.out.println(s);
    }
}

input.txt:

B
B C
A B C B
F D E
D E

output:

B
D E

Verify whether or not the current sentence contains any of the Strings that we have currently found unique. I do not believe that there is a more efficient manner of doing it.

Upvotes: 1

Related Questions