Jon
Jon

Reputation: 73

Attempting to extract numbers from text data, but re.findall() doesn't find anything

My objective is to write a program using regex that reads through a text file, and pulls out the numbers (as strings, then converts to integers) but I'm clearly missing some crucial element of this code. Here's what I have so far:

import re

#read the file
name = input('Input file name:')
handle = open(name)

#look for integers usings re.findall() / '[0-9]+'
y = re.findall('[0-9]+',handle)
print(y)

and it returns

Traceback (most recent call last):
  File "regexnumbers.py", line 8, in <module>
    y = re.findall('[0-9]+',handle)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 181, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

which to be honest doesn't make much sense to me as a beginner dev!

Upvotes: 1

Views: 211

Answers (2)

heemayl
heemayl

Reputation: 42107

You're supposed a pass a string or buffer to re.findall but you're passing a file object -- handle, hence the error.

You can read all the file at once by using the read() method on the file object:

re.findall('[0-9]+',handle.read())

But if your file is large, a better approach would to read the file line by line (as the file object is an iterator) and use generator expression (or list comprehension) to refer the results:

matches = (re.findall('[0-9]+', line) for line in handle)

and then you can join up the matches iterator using itertools.chain:

itertools.chain.from_iterable(matches)
itertools.chain(*matches)

calling list on it would get you the result as a list:

list(itertools.chain.from_iterable(matches))

If you need simple iteration over the results, no need to convert into a list.

Now, after the operation you need to close the file object to make sure the file descriptor it refers to is closed properly and the resources are released:

handle.close()

But a better and idiomatic way would be to use a context manager that does the job of closing automatically for you:

with open('file.txt') as handle:
    matches = list(itertools.chain.from_iterable(re.findall('[0-9]+', line) for line in handle)) 

Upvotes: 1

Ventil
Ventil

Reputation: 61

There are multiple things you're not doing the best way, but essentially it boils down to this:

For reading a file, the best construct would be:

with open('file') as filehandler:
    file_contents = filehandler.readlines()

This will read the contents of the file into a list (array if you will) and importantly, as the paragraph ends, closes the file. Then you can iterate through lines (each line as one list item) and do the regex statements on each.

The problem with your code is that you are passing an object (the file handler itself) to the re module. Hence the TypeError

Upvotes: 0

Related Questions