Reputation: 39
I am trying to count each string that has @twitter or @facebook on this pdf file with 1537 pages. I initialized a counter that goes off each time the page finds an @twitter or @facebook but the counter is just counting the amount of pages instead of the amounts of emails that contain facebook or twitter. I am using python 3 and importing pdftotext to read the file. here is the code
import pdftotext
count = 0
# 1 read the pdf
with open('Users.pdf', 'rb') as f:
pdf = pdftotext.PDF(f)
# loop thru pages
for page in pdf:
if '@facebook' in page or '@twitter' in page:
count += 1
print(count)
the output:
1537
which is the amount of pages the file has
Upvotes: 0
Views: 126
Reputation: 547
As suggested by manny you should use regex matching to achieve what you want to do.
import pdftotext
import re
count = 0
# 1 read the pdf
with open('Users.pdf', 'rb') as f:
pdf = pdftotext.PDF(f)
# regex pattern
pattern = '@facebook|@twitter'
# loop thru pages
for page in pdf:
count += len(re.findall(pattern, page))
print(count)
To check and try your regex pattern, I recommend Regex101.
Upvotes: 2