Reputation: 1456
I have found several ways to count the occurrence of a single character in a file in Java. My question is simply this: is there any way to count the occurrence of any of the characters in a list in a file simultaneously, or am I going to have to loop through each character?
To clarify, I'm wanting something equivalent to: For each character in file, if character in list "abcdefg" increment 1.
Background: I'm counting predicates in a file, and the best method I could think of was to search for occurrences of <, >, ==, etc.
Upvotes: 2
Views: 1442
Reputation: 3327
Use a Map<Character, Integer>
and go through the file. For every character you test to see if it is in the map. If it is not add it with value 1, otherwise get the current value, increment it and put it back. Test both TreeMap
and HashMap
to see which works best for you. Now you have a complete histogram and you can easily add the interesting sums.
Update: Saw that you are interested in finding sequences. If you want to do that with good performance I would use a tool like lex, but for Java. A quick google led me to this one: http://www.cs.princeton.edu/~appel/modern/java/JLex/ It should be straight forward to define the tokens you are interested in, and then it should be very simple to count them.
Update 2: I couldn't resist to play with it. Here is a sample that seems to work using the above mentioned tool (disclaimer: I haven't used the tool so this could be completely wrong...):
import java.lang.System;
import java.util.Map;
import java.util.TreeMap;
class Sample {
public static void main(String argv[]) throws java.io.IOException {
Map<String,Integer> map = new TreeMap<>();
Yylex yy = new Yylex(System.in);
Yytoken t;
while ((t = yy.yylex()) != null) {
String text = t.mText;
if (!text.isEmpty()) {
Integer i = map.get(text);
if (i == null) {
map.put(text, 1);
}
else {
map.put(text, map.get(text)+1);
}
}
}
System.out.println(map);
}
}
class Yytoken {
public String mText;
Yytoken(String text) {
mText = text;
}
public String toString() {
return "Token: " + mText;
}
}
%%
OTHER=(.|[\r\n])
%%
<YYINITIAL> "," { return (new Yytoken(yytext())); }
<YYINITIAL> ":" { return (new Yytoken(yytext())); }
<YYINITIAL> ";" { return (new Yytoken(yytext())); }
<YYINITIAL> "(" { return (new Yytoken(yytext())); }
<YYINITIAL> ")" { return (new Yytoken(yytext())); }
<YYINITIAL> "[" { return (new Yytoken(yytext())); }
<YYINITIAL> "]" { return (new Yytoken(yytext())); }
<YYINITIAL> "{" { return (new Yytoken(yytext())); }
<YYINITIAL> "}" { return (new Yytoken(yytext())); }
<YYINITIAL> "." { return (new Yytoken(yytext())); }
<YYINITIAL> "+" { return (new Yytoken(yytext())); }
<YYINITIAL> "-" { return (new Yytoken(yytext())); }
<YYINITIAL> "*" { return (new Yytoken(yytext())); }
<YYINITIAL> "/" { return (new Yytoken(yytext())); }
<YYINITIAL> "=" { return (new Yytoken(yytext())); }
<YYINITIAL> "<>" { return (new Yytoken(yytext())); }
<YYINITIAL> "<" { return (new Yytoken(yytext())); }
<YYINITIAL> "<=" { return (new Yytoken(yytext())); }
<YYINITIAL> ">" { return (new Yytoken(yytext())); }
<YYINITIAL> ">=" { return (new Yytoken(yytext())); }
<YYINITIAL> "&" { return (new Yytoken(yytext())); }
<YYINITIAL> "|" { return (new Yytoken(yytext())); }
<YYINITIAL> ":=" { return (new Yytoken(yytext())); }
<YYINITIAL> "#" { return (new Yytoken(yytext())); }
<YYINITIAL> {OTHER} { return (new Yytoken("")); }
Upvotes: 4
Reputation: 14247
If I understand correctly, you would like to find the number occurrences of not only single characters, but of short sequences of characters (i.e. Strings), such as ==
. In that case, a Map<Character, Integer>
is insufficient, you need a Map<String, Integer>
to store a count for each string.
You can alternatively use a Guava's Multiset, which is basically a nice interface for a collection that knows how many times it contains duplicate (same) elements.
I believe that the number of predicates/operators/whatever-short-strings you want to count is defined, you can define an array / a list which would store all of the predicates that you are interested in, such as:
List<String> operators = Arrays.asList("==", "<=", ">=", "<", ">");
Then you would "pour" all those operators as keys to the map and initialize their values to zero:
Map<String, Integer> counts = new HashMap<>();
for (String operator : operators)
counts.put(operator, 0);
As for the parsing, you can easily read the file line-by-line using a Scanner. And for each line, you can use a method like this to count the number of times it contains a given sub-string:
static int occurrences(String source, String subString) {
int count = 0;
int index = source.indexOf(subString);
while (index != -1) {
count++;
index = source.indexOf(subString, index + 1);
}
return count;
}
And then using this method in a similar fashion to this:
Scanner scanner = new Scanner(new File("input.txt"));
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
for (String operator : operators) {
int oldOccurences = counts.get(operator);
counts.put(operator, oldOccurences + occurrences(line, operator));
}
}
Upvotes: 2
Reputation: 80176
Since you want to count the predicates which are more than 1 character (==, !=, <-, >=) you would require a PushBackReader so that you can peek into the next character to determine the actual predicate.
If you can afford to have an additional dependency then my suggestion is to use Multiset which was meant to count frequencies. If you can't then you can use Map or array based counter (I prefer this if your predicate set is finite as this simplifies the code).
Using the above approach is simpler as you can get the frequencies in 1 single pass. If your file is huge or have to count the frequencies across many many files then you can opt for parallelizing this using java Executors.
Upvotes: 2
Reputation: 200168
A trivial way to do it is with an array:
final int[] occurs = new int[65536];
for (char c : file) occurs[c]++;
If you know you won't encounter too exotic chars, you can reduce the size of the array.
Upvotes: 0
Reputation: 47267
To "count the occurrence of any of the characters in a list in a file simultaneously
":
If the set of characters you care about is small (such as the "abcdefg"
or "<, >, =="
in your example), a switch statement will suffice instead of using a HashTable to solve the problem
Upvotes: 1
Reputation: 31194
I believe that the java list interface has a Contains()
method, so you can do something like
if(someList.Contains('<'))
{
x++
}
IT doesn't actually check for them all at once, but that stuff is hidden from you anyway
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/List.html
Upvotes: 1