user14102083
user14102083

Reputation: 39

Python Count each email on each page of PDF file

I am trying to count each string that has @twitter or @facebook on this pdf file with 1537 pages. I initialized a counter that goes off each time the page finds an @twitter or @facebook but the counter is just counting the amount of pages instead of the amounts of emails that contain facebook or twitter. I am using python 3 and importing pdftotext to read the file. here is the code

import pdftotext
count = 0
# 1 read the pdf
with open('Users.pdf', 'rb') as f:
    pdf = pdftotext.PDF(f)

# loop thru pages
for page in pdf:
    if '@facebook' in page or '@twitter' in page:
        count += 1


print(count)

the output:

1537

which is the amount of pages the file has

Upvotes: 0

Views: 126

Answers (1)

vidu.sh
vidu.sh

Reputation: 547

As suggested by manny you should use regex matching to achieve what you want to do.

import pdftotext
import re

count = 0
# 1 read the pdf
with open('Users.pdf', 'rb') as f:
    pdf = pdftotext.PDF(f)

# regex pattern
pattern = '@facebook|@twitter'

# loop thru pages
for page in pdf:
    count += len(re.findall(pattern, page))

print(count)

To check and try your regex pattern, I recommend Regex101.

Upvotes: 2

Related Questions