Hari Ram
Hari Ram

Reputation: 305

Find space separated names using Apache OpenNLP

I am using NER of Apache Open NLP. I have successfully trained my custom data. And while using the name finder, I am splitting the given string based on white space and passing the string array as given below.

NameFinderME nameFinder = new NameFinderME(model);   
String []sentence = input.split(" "); //eg:- input = Give me list of test case in project X
Span nameSpans[] = nameFinder.find(sentence);

Here, when I use split, test and case are given as separate values and is never detected by the namefinder. How would I possibly overcome the above issue. Is there a way by which I can pass the complete string (without splitting it into array) such that, test case will be considered as a whole by itself ?

Upvotes: 4

Views: 450

Answers (1)

Iakovos
Iakovos

Reputation: 1982

You can do it using regular expressions. Try replacing the second line with this:

String []sentence = input.split("\\s(?<!(\\stest\\s(?=case\\s)))");

Maybe there is a better way to write the expression, but this works for me and the output is:

Give
me
list
of
test case
in
project
X

EDIT: If you are interested in the details check here where I split: https://regex101.com/r/6HLBnL/1

EDIT 2: If you have many words that don't get separated, I wrote a method that generates the regex for you. This is how the regex in this case should look like (if you don't want to separate 'test case' and 'in project'):

\s(?<!(\stest\s(?=case\s))|(\sin\s(?=project\s)))

Following is a simple program to demonstrate it. In this example you just put the words that don't need separation in the array unseparated.

class NoSeparation {

private static String[][] unseparated = {{"test", "case"}, {"in", "project"}};

private static String getRegex() {
    String regex = "\\s(?<!";

    for (int i = 0; i < unseparated.length; i++)
        regex += "(\\s" + separated[i][0] + "\\s(?=" + separated[i][1] + "\\s))|";

    // Remove the last |
    regex = regex.substring(0, regex.length() - 1);

    return (regex + ")");
}

public static void main(String[] args) {
    String input = "Give me list of test case in project X";
    String []sentence = input.split(getRegex());

    for (String i: sentence)
        System.out.println(i);
}
}

EDIT 3: Following is a very dirty way to handle strings with more than 2 words. It works, but I am pretty sure that you can do it in a more efficient way. It will work fine in short inputs, but in longer it will probably be slow.

You have to put the words that should not be splitted in a 2d array, as in unseparated. You should also choose a separator if you don't want to use %% for some reason (e.g. if there is a chance your input contains it).

class NoSeparation {

private static final String SEPARATOR = "%%";
private static String[][] unseparated = {{"of", "test", "case"}, {"in", "project"}};

private static String[] splitString(String in) {
    String[] splitted;

    for (int i = 0; i < unseparated.length; i++) {
        String toReplace = "";
        String replaceWith = "";
        for (int j = 0; j < unseparated[i].length; j++) {
            toReplace += unseparated[i][j] + ((j < unseparated[i].length - 1)? " " : "");
            replaceWith += unseparated[i][j] + ((j < unseparated[i].length - 1)? SEPARATOR : "");
        }

        in = in.replaceAll(toReplace, replaceWith);
    }

    splitted = in.split(" ");

    for (int i = 0; i < splitted.length; i++)
        splitted[i] = splitted[i].replaceAll(SEPARATOR, " ");

    return splitted;
}

public static void main(String[] args) {
    String input = "Give me list of test case in project X";
    // Uncomment this if there is a chance to have multiple spaces/tabs
    // input = input.replaceAll("[\\s\\t]+", " ");

    for (String str: splitString(input))
        System.out.println(str);
}
}

Upvotes: 2

Related Questions