Cael
Cael

Reputation: 556

Count the occurence of word in a list containing sentences

I am having some problem with Java programming which includes List. Basically, what I am trying to count the occurences of each word in a sentence from a list containing several sentences. The code for the list containing sentences is as below:

List<List<String>> sort = new ArrayList<>();
for (String sentence : complete.split("[.?!]\\s*"))
{
    sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
}

The output from the list is as follows:

[hurricane, gilbert, head, dominican, coast]
[hurricane, gilbert, sweep, dominican, republic, sunday, civil, defense, alert, heavily, populate, south, coast, prepare, high, wind]
[storm, approach, southeast, sustain, wind, mph, mph]
[there, alarm, civil, defense, director, a, television, alert, shortly]

The output desired should be as follows (only an example). It will output all the unique word in the list and calculate the occurences by sentences.

Word: hurricane
Sentence 1: 1 times
Sentence 2: 1 times
Sentence 3: 0 times
Sentence 4: 0 times

Word: gilbert
Sentence 1: 0 times
Sentence 2: 2 times
Sentence 3: 1 times
Sentence 4: 0 times 

Word: head
Sentence 1: 3 times
Sentence 2: 2 times
Sentence 3: 0 times
Sentence 4: 0 times 

and goes on....

With the example above, the word 'hurricane' occur 1 time in the first sentence, 1 time in second sentence, none in third sentence and none in forth sentence. How do I achieve the output? I was thinking of a 2D matrices for building them. Any help will be appreciated. Thanks!

Upvotes: 0

Views: 390

Answers (3)

birkett
birkett

Reputation: 10101

This is a working solution. I did not take care of the printing. The result is a Map -> Word, Array. Where Array contains the count of Word in each sentence indexed from 0. Runs in O(N) time. Play here: https://repl.it/Bg6D

    List<List<String>> sort = new ArrayList<>();
    Map<String, ArrayList<Integer>> res = new HashMap<>();

    // split by sentence
    for (String sentence : someText.split("[.?!]\\s*")) {
        sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
    }

    // put all word in a hashmap with 0 count initialized
    final int sentenceCount = sort.size();
    sort.stream().forEach(sentence -> sentence.stream().forEach(s -> res.put(s, new ArrayList<Integer>(Collections.nCopies(sentenceCount, 0)))));

    int index = 0;
    // count the occurrences of each word for each sentence.
    for (List<String> sentence: sort) {
        for (String s : sentence) {
            res.get(s).set(index, res.get(s).get(index) + 1);
        }
        index++;
    }

EDIT: In answer to your comment.

  List<Integer> getSentence(int sentence, Map<String, ArrayList<Integer>> map) {
     return map.entrySet().stream().map(e -> e.getValue().get(sentence)).collect(Collectors.toList());
  }

Then you can call

List<Integer> sentence0List = getSentence(0, res);

However be aware that this approach is not optimal since it runs in O(K) time with K being the number of sentences. For small K it is totally fine but it does not scale. You have to clarify yourself what will you do with the result. If you need to call getSentence many times, this is not the correct approach. In that case you will need the data structured differently. Something like

Sentences = [
         {'word1': N, 'word2': N},... // sentence 1 
         {'word1': N, 'word2': N},... // sentence 2

]

So you are able to easily access the word count per each sentence.

EDIT 2: Call this method:

  Map<String, Float> getFrequency(Map<String, ArrayList<Integer>> stringMap) {
    Map<String, Float> res = new HashMap<>();
    stringMap.entrySet().stream().forEach(e -> res.put(e.getKey()
                , e.getValue().stream().mapToInt(Integer::intValue).sum() / (float)e.getValue().size()));
    return res;
  }

Will return something like this:

{standard=0.25, but=0.25, industry's=0.25, been=0.25, 1500s=0.25, software=0.25, release=0.25, type=0.5, when=0.25, dummy=0.5, Aldus=0.25, only=0.25, passages=0.25, text=0.5, has=0.5, 1960s=0.25, Ipsum=1.0, five=0.25, publishing=0.25, took=0.25, centuries=0.25, including=0.25, in=0.25, like=0.25, containing=0.25, printer=0.25, is=0.25, t

Upvotes: 1

MAZDAK
MAZDAK

Reputation: 583

First you can segment your sentences and then tokenize them using a text segmentor such as NLTK or Stanford tokenizer. Splitting the string (containing sentences) around "[.?!]" is not a good idea. What happens to an "etc." or "e.g." that occurs in the middle of the sentence? Splitting a sentence around "[ ,;:]" is also not a good idea. You can have plenty of other symbols in a sentence such as quotation marks, dash and so on.

After segmentation and tokenization you can split your sentences around space and store them in a List<List<String>>:

List<List<String>> sentenceList = new ArraList();

Then for your index you can create a HashMap<String,List<Integer>>:

HashMap<String,List<Integer>> words = new HashMap();

Keys are all words in all sentences. Values you can update as follows:

for(int i = 0 ; i < sentenceList.size() ; i++){
    for(String w : words){
        if(sentence.contains(w)){
           List tmp = words.get(w);
           tmp.get(i)++; 
           words.put(w, tmp);
         }
    }
}

This solution has the time complexity of O(number_of_sentences*number_of_words) which is equivalent to O(n^2). An optimized solution is:

for(int i = 0 ; i < sentenceList.size() ; i++){
    for(String w : sentenceList.get(i)){
        List tmp = words.get(w);
        tmp.get(i)++; 
        words.put(w, tmp);
    }
}

This has the time complexity of O(number_of_sentences*average_length_of_sentences). Since average_length_of_sentences is usually small this is equivalent to O(n).

Upvotes: 0

Jankapunkt
Jankapunkt

Reputation: 8423

You could solve your problem by first creating an index for each word. You could use a Hashmap and put just put all the single words on it, which you find in your text (so you would have no need for checking double occurrences).

Then you can iterate the HashMap and check for every Word in every sentence. You can count occurrences by using the indexOf method of your list. As long as it returns a value greater than -1 you can count up the occurrence in the sentence. This method does only return the first occurrence so you

Some Pseudocode would be like:

Array sentences = text.split(sentence delimiter)

for each word in text
    put word on hashmap

for each entry in hashmap
   for each sentence
       int count = 0
       while subList(count, sentence.length) indexOf(entry) > -1
          count for entry ++

Note that this is very greedy and not performance oriented at all. Oh yea, and also note, that there are some java nlp libraries out there which may have already solved your problem in a performance oriented and reusable way.

Upvotes: 0

Related Questions