Reputation: 292
I am trying to read a file and match words there has over 6 characters in it. but I keep getting this error:
Traceback (most recent call last):
File "dummy.py", line 9, in <module>
matches = re.findall("\w{6,}", f.read().split())
File "/usr/lib/python2.7/re.py", line 181, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
And I can't figure out why I am getting this error? The code is pasted below
import re
with open('test.txt', 'r') as f:
matches = re.findall("\w{6,}", f.read().split())
nr_long_words = len(matches)
print (matches)
Upvotes: 0
Views: 1316
Reputation: 82899
f.read().split()
gives a list of strings, but re.findall
expects a single string, thus the TypeError: expected string or buffer
. You could apply the regex to each of the substrings in a loop or list comprehension, but you do not need to split()
at all:
matches = re.findall("\w{6,}", f.read())
Note that if the file is very large, then f.read()
might not be a good idea (but for text files it's probably not an issue, as those are rarely marger than a few megabytes, if at all). In this case, you could read the file line-by-line and sum up the long words per line:
nr_long_words = sum(len(re.findall(r"\w{6,}", line)) for line in f)
Also, as noted in comments, \w{6,}
might not be the best regex for "long words" to start with. \w
will, e.g., also match numbers or the underscore _
. If you want to match exclusively (ascii-)letters, better use [A-Za-z]
, but this might cause problems with non-ascii letters, such as umlauts, accents, arabic, etc. Also, you might want to include word boundary characters, i.e. \b
, to make sure that the six letters are not part of a longer, non-word sequence, i.e. use a regex like r'\b[A-Za-z]{6,}\b'
Upvotes: 1
Reputation: 8978
Try:
import re
nr_long_words = 0
with open('input.txt', 'r') as f:
for line in f:
matches = re.findall("\w{6,}", line)
nr_long_words += len(matches)
print(nr_long_words)
it should print count of words longer than 6 characters in file.
Upvotes: 0