Sharon Torao
Sharon Torao

Reputation: 11

Search for text contained in a text file and remove them from another text file in java

I have a text file that was output from a java program that finds the frequency of people's names mentioned in multiple documents and writes them to a file (peopleNames.txt) like this:

article1location\article1 name1:countofname1# name2:countofname2# name3:countofname3# ...
article2location\article2 name1:countofname1# name2:countofname2# name3:countofname3# ...
article3location\article3 name1:countofname1# name2:countofname2# name3:countofname3# ...

the names correspond to people names identified in each article along with the frequency they appear in the article, there are about 90,000 articles. I have another text file (titles.lst) that contains a list of about 40 different titles and their abbreviations (such as Mr., Mrs., President, Sir etc.) I would like to use this list in the file to search for and remove these titles from peopleNames.txt. I am not sure how to go about it in java as I am new to java and need to modify the original code in java that produced peopleNames.txt to accommodate title removal.

My program is identifying a person such as Mr John Smith as different from John Smith, so removing the titles would give me a more accurate count of the names mentioned in the articles.

Thanks in advance for any help.

Upvotes: 1

Views: 114

Answers (2)

Philip Couling
Philip Couling

Reputation: 14913

You can use regular expressions to remove all instances: public class Test {

    public static void main( String[] args ) throws Exception {
        String s = "Mr Tom and Ms Jane";
        s = s.replaceAll("\\bMr\\b|\\bMs\\b", "");
        System.out.println(s);
    }

For the sake of explaining the comments:

    public static void main( String[] args ) throws Exception {
        String [] titles = args;
        String regex = "\\b"+titles[0]+"\\b";
        for (int i=1; i<titles.length; i++) {
            regex += "|\\b" + titles[i] + "\\b";
        }

        String s = "Mr Tom and Ms Jane";
        s = s.replaceAll(regex, "");
        System.out.println(s);
    }

You can also use the replace option repeatedly rather than building a regular expression. I don't actually know which is quicker. I would hazard a guess that it depends on the java implementation.

    public static void main( String[] args ) throws Exception {
        String [] titles = args;
        String s = "Mr Tom and Ms Jane";
        for (int i=1; i<titles.length; i++) {
            s = s.replaceAll("\\b"+titles[0]+"\\b", "");
        }
        System.out.println(s);
    }

Upvotes: 3

DeadlyJesus
DeadlyJesus

Reputation: 1533

This is what I would do:
1. Parse the titles.lst document and put every title in a Set
2. Parse peopleNames.txt, and for every line check if the name is in the Title's Set
3. If it is, remove it.
4. Check for double entry, since Mr. John Smith and John Smith will now be the same.

Upvotes: 1

Related Questions