Reputation: 101
I'm working on the exercise in the book Python for Informatics which asks me to write a program to simulate the operation of the grep command on UNIX. However, my code doesn't work. Here I simplified my code and only intend to calculate how many lines start with the word 'Find'. I'm quite confused and wish you could cast light on it.
from urllib.request import urlopen
import re
fhand = urlopen('http://www.py4inf.com/code/mbox-short.txt')
sumFind = 0
for line in fhand:
line = str(line) #convert from byte to string for re operation
if re.search('^From',line) is not None:
sumFind+=1
print(f'There are {sumFind} lines that match.')
The output of the script is
There are 0 lines that match.
And here is the link of the input text: text
Thanks a lot for your time.
Upvotes: 1
Views: 431
Reputation: 1133
You're issue is that the urllib module returns bytes instead of strings from the url/text file.
You can either:
Use requests module to download file as string and split by lines:
import requests
txt = requests.get('http://www.py4inf.com/code/mbox-short.txt').text.split('\n')
for line in txt: ...
Upvotes: 0
Reputation: 140188
the mistake is to convert bytes to string using str
.
>>> str(b'foo')
"b'foo'"
You would have needed
line = line.decode()
But the best way is to pass a bytes regex to the regex, that is supported:
for line in fhand:
if re.search(b'^From',line) is not None:
sumFind+=1
now I get 54 matches.
note that you could simplify the whole loop to:
sum_find = sum(bool(re.match(b'From',line)) for line in fhand)
re.match
replaces the need to use ^
with searchsum
counts the times where re.match
returns a truthy value (explicitly converted to bool
so it can sum 0 or 1)or even simpler without regex:
sum_find = sum(line.startswith(b"From") for line in fhand)
Upvotes: 6