lkilgoretrout
lkilgoretrout

Reputation: 97

Java: counting occurence of words, program counts 'empty' words

I have a program that takes input from a text file , removes the punctuation and then splits by single space and tallies the results into a map. I can get it to work but I am getting an empty result in the map as well and I don't know what gives:

scanner takes an input:

try
        {
            Scanner input = new Scanner(file);
            String nextLine;
            while (input.hasNextLine())
            {
                nextLine = input.nextLine().trim();
                processLine(nextLine, occurrenceMap);
            }
            input.close();
        }
        catch(Exception e) { System.out.println("Something has gone wrong!");}

The text file that it is pullinng from is a king james version of the bible then a separate function processes each line:

//String[] words = line.replaceAll("[^a-zA-Z0-9 ]", " ").toLowerCase().split("\\s+"); // runtime for  bible.txt is ~1600ms

// changed to simple iteration and the program ran MUCH faster:

char[] letters = line.trim().toCharArray();
for (int i=0; i<letters.length; i++)
{
    if (Character.isLetterOrDigit(letters[i])) {continue;}
    else {letters[i] = ' ';}
}

String punctuationFree = new String(letters);
String[] words = punctuationFree.toLowerCase().split("\\W+");

// add each word to the frequency map:
for (int i=0; i<words.length; i++)
{
    if (! map.containsKey(words[i]))
    {
        map.put(words[i], 1);
    }
    else
    {
        int value = (int)map.get(words[i]);
        map.put(words[i], ++value);
    }
}

As you can see I first did it with a replace all, and then i came up with my own funky iterative method (which seems to run faster). In both cases when I print out the results using PrintWriter I am getting a strange entry at the beginning:

num occurences/ (number /word)

25307 :     // what is up with this empty value ?
1 : 000     // the results continue in sorted order
2830 : 1
2122 : 10
6 : 100
9 : 101
29 : 102
23 : 103
36 : 104
46 : 105
49 : 106

I have tried changing String[] words = punctuationFree.toLowerCase().split("\\W+"); to .split("\s+") and .split(" ") but I am still getting this empty value in the results.

I am trying to only count occurences of words and numbers, why am I getting this empty value ?

UPDATE: at the suggestion that Character.isLetterOrDigit() might be returning unwanted characters I rewrote the checks as so to only get characters I want. I nonetheless am still getting a mystery empty value:

for (int i=0; i<letters.length; i++)
    {
        if ((letters[i] >= 'a' && letters[i] <= 'z') || 
           (letters[i] >= 'A' && letters[i] <= 'Z'))
           {continue;}
        else if (letters[i] >= '0' && letters[i] <= '9')
           {continue;}
        else if ((letters[i] == ' ')||(letters[i] =='\n')||(letters[i] == '\t'))
           {continue;}
        else
            letters[i] = ' ';
    }

Upvotes: 0

Views: 65

Answers (1)

Jeff
Jeff

Reputation: 654

Just guessing but the Character method IsLetterOrDigit is defined to work on the whole unicode range. Per the document page, it includes all "Valid letters and decimal digits are members of the following categories in UnicodeCategory: UppercaseLetter, LowercaseLetter, TitlecaseLetter, ModifierLetter, OtherLetter, or DecimalDigitNumber."

I think that this method is keeping Characters (ModifierLetter and/ or OtherLetter in particular) you do not want and which are not included in your font so you cannot see them.

Edit 1: I tested your algorithm. It turns out that a blank line circumvents your tests because it skips the for loop. You need to add a line length just after you read a line from the file line this:

if (nextLine.length() == 0) {continue;}

Edit 2: Also, since you are scanning every character to weed out the "non-word and non-digits", you could also incorporate the logic to create the words and add them to the collection. Like this maybe:

private static void WordSplitTest(String line) {
    char[] letters = line.trim().toCharArray();

    boolean gotWord = false;

    String word = "";

    for (int i = 0; i < letters.length; i++) {
        if (!Character.isLetterOrDigit(letters[i])) {

            if(!gotWord) {continue;}

            gotWord = false;

            AddWord(word);
        }
        if (gotWord) {
            word += Character.toString(letters[i]);
        }
    }
}

private static void AddWord(String word) {
    if (!map.containsKey(word)) {
        map.put(word, 1);
    } else {
        int value = (int) map.get(word);
        map.put(word, ++value);
    }
}

Upvotes: 1

Related Questions