Python : trying to match and count words with regex (expected string or buffer)

Question

I am trying to read a file and match words there has over 6 characters in it. but I keep getting this error:

Traceback (most recent call last):
  File "dummy.py", line 9, in 
    matches = re.findall("\w{6,}", f.read().split())
  File "/usr/lib/python2.7/re.py", line 181, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

And I can't figure out why I am getting this error? The code is pasted below

import re

with open('test.txt', 'r') as f:
    matches = re.findall("\w{6,}", f.read().split())
    nr_long_words = len(matches)
    print (matches)

tobias_k · Accepted Answer

f.read().split() gives a list of strings, but re.findall expects a single string, thus the TypeError: expected string or buffer. You could apply the regex to each of the substrings in a loop or list comprehension, but you do not need to split() at all:

matches = re.findall("\w{6,}", f.read())

Note that if the file is very large, then f.read() might not be a good idea (but for text files it's probably not an issue, as those are rarely marger than a few megabytes, if at all). In this case, you could read the file line-by-line and sum up the long words per line:

nr_long_words = sum(len(re.findall(r"\w{6,}", line)) for line in f)

Also, as noted in comments, \w{6,} might not be the best regex for "long words" to start with. \w will, e.g., also match numbers or the underscore _. If you want to match exclusively (ascii-)letters, better use [A-Za-z], but this might cause problems with non-ascii letters, such as umlauts, accents, arabic, etc. Also, you might want to include word boundary characters, i.e. \b, to make sure that the six letters are not part of a longer, non-word sequence, i.e. use a regex like r'\b[A-Za-z]{6,}\b'

Python : trying to match and count words with regex (expected string or buffer)

Answers (2)

Related Questions