Reputation: 1456

Count the occurrence of any number of characters in a file?

I have found several ways to count the occurrence of a single character in a file in Java. My question is simply this: is there any way to count the occurrence of any of the characters in a list in a file simultaneously, or am I going to have to loop through each character?

To clarify, I'm wanting something equivalent to: For each character in file, if character in list "abcdefg" increment 1.

Background: I'm counting predicates in a file, and the best method I could think of was to search for occurrences of <, >, ==, etc.

Upvotes: 2

Answers (6)

Tobias Ritzau

Reputation: 3327

Use a Map<Character, Integer> and go through the file. For every character you test to see if it is in the map. If it is not add it with value 1, otherwise get the current value, increment it and put it back. Test both TreeMap and HashMap to see which works best for you. Now you have a complete histogram and you can easily add the interesting sums.

Update: Saw that you are interested in finding sequences. If you want to do that with good performance I would use a tool like lex, but for Java. A quick google led me to this one: http://www.cs.princeton.edu/~appel/modern/java/JLex/ It should be straight forward to define the tokens you are interested in, and then it should be very simple to count them.

Update 2: I couldn't resist to play with it. Here is a sample that seems to work using the above mentioned tool (disclaimer: I haven't used the tool so this could be completely wrong...):

import java.lang.System;
import java.util.Map;
import java.util.TreeMap;

class Sample {
  public static void main(String argv[]) throws java.io.IOException {
    Map<String,Integer> map = new TreeMap<>();

    Yylex yy = new Yylex(System.in);
    Yytoken t;
    while ((t = yy.yylex()) != null) {
      String text = t.mText;

      if (!text.isEmpty()) {
        Integer i = map.get(text);
        if (i == null) {
          map.put(text, 1);
        }
        else {
          map.put(text, map.get(text)+1);
        }
      }
    } 

    System.out.println(map);
  }
}

class Yytoken {
  public String mText;

  Yytoken(String text) {
   mText = text;
  }

  public String toString() {
    return "Token: " + mText;
  }
}

%%

OTHER=(.|[\r\n])

%% 

<YYINITIAL> "," { return (new Yytoken(yytext())); }
<YYINITIAL> ":" { return (new Yytoken(yytext())); }
<YYINITIAL> ";" { return (new Yytoken(yytext())); }
<YYINITIAL> "(" { return (new Yytoken(yytext())); }
<YYINITIAL> ")" { return (new Yytoken(yytext())); }
<YYINITIAL> "[" { return (new Yytoken(yytext())); }
<YYINITIAL> "]" { return (new Yytoken(yytext())); }
<YYINITIAL> "{" { return (new Yytoken(yytext())); }
<YYINITIAL> "}" { return (new Yytoken(yytext())); }
<YYINITIAL> "." { return (new Yytoken(yytext())); }
<YYINITIAL> "+" { return (new Yytoken(yytext())); }
<YYINITIAL> "-" { return (new Yytoken(yytext())); }
<YYINITIAL> "*" { return (new Yytoken(yytext())); }
<YYINITIAL> "/" { return (new Yytoken(yytext())); }
<YYINITIAL> "=" { return (new Yytoken(yytext())); }
<YYINITIAL> "<>" { return (new Yytoken(yytext())); }
<YYINITIAL> "<"  { return (new Yytoken(yytext())); }
<YYINITIAL> "<=" { return (new Yytoken(yytext())); }
<YYINITIAL> ">"  { return (new Yytoken(yytext())); }
<YYINITIAL> ">=" { return (new Yytoken(yytext())); }
<YYINITIAL> "&"  { return (new Yytoken(yytext())); }
<YYINITIAL> "|"  { return (new Yytoken(yytext())); }
<YYINITIAL> ":=" { return (new Yytoken(yytext())); }
<YYINITIAL> "#" { return (new Yytoken(yytext())); }
<YYINITIAL> {OTHER} { return (new Yytoken("")); }

Upvotes: 4

Natix

Reputation: 14247

Storing

If I understand correctly, you would like to find the number occurrences of not only single characters, but of short sequences of characters (i.e. Strings), such as ==. In that case, a Map<Character, Integer> is insufficient, you need a Map<String, Integer> to store a count for each string.

You can alternatively use a Guava's Multiset, which is basically a nice interface for a collection that knows how many times it contains duplicate (same) elements.

I believe that the number of predicates/operators/whatever-short-strings you want to count is defined, you can define an array / a list which would store all of the predicates that you are interested in, such as:

List<String> operators = Arrays.asList("==", "<=", ">=", "<", ">");

Then you would "pour" all those operators as keys to the map and initialize their values to zero:

Map<String, Integer> counts = new HashMap<>();
for (String operator : operators)
    counts.put(operator, 0);

Parsing

As for the parsing, you can easily read the file line-by-line using a Scanner. And for each line, you can use a method like this to count the number of times it contains a given sub-string:

static int occurrences(String source, String subString) {
    int count = 0;
    int index = source.indexOf(subString);

    while (index != -1) {
        count++;
        index = source.indexOf(subString, index + 1);
    }
    return count;
}

And then using this method in a similar fashion to this:

Scanner scanner = new Scanner(new File("input.txt"));
while (scanner.hasNextLine()) {
    String line = scanner.nextLine();
    for (String operator : operators) {
        int oldOccurences = counts.get(operator);
        counts.put(operator, oldOccurences + occurrences(line, operator));
    }
}

Upvotes: 2

Aravind Yarram

Reputation: 80176

Reading

Since you want to count the predicates which are more than 1 character (==, !=, <-, >=) you would require a PushBackReader so that you can peek into the next character to determine the actual predicate.

Frequency of occurence

If you can afford to have an additional dependency then my suggestion is to use Multiset which was meant to count frequencies. If you can't then you can use Map or array based counter (I prefer this if your predicate set is finite as this simplifies the code).

Parallelize?

Using the above approach is simpler as you can get the frequencies in 1 single pass. If your file is huge or have to count the frequencies across many many files then you can opt for parallelizing this using java Executors.

Upvotes: 2

Marko Topolnik

Reputation: 200168

A trivial way to do it is with an array:

final int[] occurs = new int[65536];
for (char c : file) occurs[c]++;

If you know you won't encounter too exotic chars, you can reduce the size of the array.

Upvotes: 0

sampson-chen

Reputation: 47267

To "count the occurrence of any of the characters in a list in a file simultaneously":

You can use a HashTable where the keys are the characters, and the values are the # of times you've seen that character.
Each time you read a character, check to see if it's in the HashTable:
- If so, increment its value by 1
- If not, add the key, value pair to the HashTable with value initialized at 1

If the set of characters you care about is small (such as the "abcdefg" or "<, >, ==" in your example), a switch statement will suffice instead of using a HashTable to solve the problem

Upvotes: 1

Sam I am says Reinstate Monica

Reputation: 31194

I believe that the java list interface has a Contains() method, so you can do something like

if(someList.Contains('<'))
{
    x++
}

IT doesn't actually check for them all at once, but that stuff is hidden from you anyway

http://docs.oracle.com/javase/1.4.2/docs/api/java/util/List.html

Upvotes: 1

Count the occurrence of any number of characters in a file?

Answers (6)

Storing

Parsing

Related Questions