Mandar Pande
Mandar Pande

Reputation: 12974

regex issue while parsing .pdf file using CAM::PDF

Unmatched [ in regex; marked by <-- HERE in m/ <-- HERE / at ./pdf_parse.pl line 37.

Actually I'm parsing .pdf file word by word [in order to make a dictionary out of it] line 37:-

if(grep(!/$word/,@line_rd)){
}

Well actual word where parser script stops working is in different font [in side the pdf which I'm parsing], is that the culprit here ?
Whether CAM::PDF recognizes words in different fonts ? What care should i do, in order to stop this !

Upvotes: 0

Views: 303

Answers (1)

Mat
Mat

Reputation: 206689

You need to quote $word in the regular expression if it can contain special chars (like [ or even .). Try with:

if (grep(!/\Q$word\E/, @line_rd)) {
  ...
}

If you want to make a dictionary of all the words, use a hash:

my %allwords;
...
  # each time you have a new word incoming from the parser:
  $allwords{$word}++;

At the end, the %allwords hash will contain the distinct words as keys, and the word count as values. You could e.g. print it using:

map {
 print "Word $_: count: ", $allwords{$_}, "\n";
} (sort keys %allwords);

Upvotes: 2

Related Questions