Marcel
Marcel

Reputation: 269

efficient way of reading integers from file

I'd like to read all integers from a file into the one list. All numbers are separated by space (one or more) or end line character (one or more). What is the most efficient and/or elegant way of doing this? I have two solutions, but I don't know if they are good or not.

  1. Checking for digits:

    for line in open("foo.txt", "r"):
        for i in line.strip().split(' '):
            if i.isdigit():
                my_list.append(int(i))
    
  2. Dealing with exceptions:

    for line in open("foo.txt", "r"):
        for i in line:
            try:
                my_list.append(int(i))
            except ValueError:
                pass
    

Sample data:

1   2     3
 4 56
    789         
9          91 56   

 10 
11 

Upvotes: 11

Views: 5469

Answers (8)

Dunes
Dunes

Reputation: 40763

This was the fastest way I found:

import re
regex = re.compile(r"\D+")

with open("foo.txt", "r") as f:
    my_list = list(map(int, regex.split(f.read())))

Though the results could depend on the size of the file.

Upvotes: 0

Anand S Kumar
Anand S Kumar

Reputation: 90979

An efficient way of doing it would be your first method with a small change of using with statement for opening the file , Example -

with open("foo.txt", "r") as f:
    for line in f:
        for i in line.split():
            if i.isdigit():
                my_list.append(int(i))

Timing tests done with comparisons to other methods -

The functions -

def func1():
    my_list = []
    for line in open("foo.txt", "r"):
        for i in line.strip().split(' '):
            if i.isdigit():
                my_list.append(int(i))
    return my_list

def func1_1():
    return [int(i) for line in open("foo.txt", "r") for i in line.strip().split(' ') if i.isdigit()]

def func1_3():
    my_list = []
    with open("foo.txt", "r") as f:
        for line in f:
            for i in line.split():
                if i.isdigit():
                    my_list.append(int(i))
    return my_list

def func2():            
    my_list = []            
    for line in open("foo.txt", "r"):
        for i in line.split():
            try:
                my_list.append(int(i))
            except ValueError:
                pass
    return my_list

def func3():
    my_list = []
    with open("foo.txt","r") as f:
        cf = csv.reader(f, delimiter=' ')
        for row in cf:
            my_list.extend([int(i) for i in row if i.isdigit()])
    return my_list

Results of timing tests -

In [25]: timeit func1()
The slowest run took 4.70 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 204 µs per loop

In [26]: timeit func1_1()
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 207 µs per loop

In [27]: timeit func1_3()
The slowest run took 5.46 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 191 µs per loop

In [28]: timeit func2()
The slowest run took 4.09 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 212 µs per loop

In [34]: timeit func3()
The slowest run took 4.38 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 202 µs per loop

Given the methods that store the data into a list, I believe func1_3() above is fastest (As shown by the timeit).


But given that , if you are really handling very large files , then you maybe better off using a generator rather than storing the complete list in memory.


UPDATE : As it was being said in the comments that func2() is faster than func1_3() (Though on my system it was never faster than func1_3() even for only integers) , updated the foo.txt to contain things other than numbers and taking timing tests -

foo.txt

1 2 10 11
asd dd
 dds asda
22 44 32 11   23
dd dsa dds
21 12
12
33
45
dds
asdas
dasdasd dasd das d asda sda

Test -

In [13]: %timeit func1_3()
The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 210 µs per loop

In [14]: %timeit func2()
1000 loops, best of 3: 279 µs per loop

In [15]: %timeit func1_3()
1000 loops, best of 3: 213 µs per loop

In [16]: %timeit func2()
1000 loops, best of 3: 273 µs per loop

Upvotes: 7

SuperBiasedMan
SuperBiasedMan

Reputation: 9969

It's pretty easy if you can read the whole file as a string. (ie. it's not too large to do that)

fileStr = open('foo.txt').read().split() 
integers = [int(x) for x in fileStr if x.isdigit()]

read() turns it into a long string, and split splits apart into a list of strings based on whitespace (ie. Spaces and newlines). So you can combine that with a list comprehension that converts them to integers if they're digits.

As Bakuriu noted, if the file is guaranteed to only have whitespace and numbers, then you don't have to check for isdigit(). Using list(map(int, open('foo.txt').read().split())) would be enough in that case. That method will raise errors if anything is an invalid integer whereas the other will skip anything that isn't a recognised digit.

Upvotes: 5

Totem
Totem

Reputation: 7369

Try this:

with open('file.txt') as f:
    nums = []
    for l in f:
        l = l.strip()
        nums.extend([int(i) for i in l.split() if i.isdigit() and l])

l.strip() is required above if newlines('\n') are present, as i.isdigit('6\n') won't work.

list.extend comes in handy here

The and l at the end makes sure to discard any empty list result

str.split splits on whitespace by default. And the with block will automatically close the file after the code within is executed. I've also made use of list comprehensions

Upvotes: 3

Marcel
Marcel

Reputation: 269

Thank you all. I've mixed some solutions you posted. This seems very good to me:

with open("foo.txt","r") as f:
    my_list = [int(i)  for line in f for i in line.split() if i.isdigit()]

Upvotes: 4

The6thSense
The6thSense

Reputation: 8335

You could do it like this using list comprehension

my_list = [int(i)  for j in open("1.txt","r") for i in j.strip().split(" ") if i.isdigit()]

Or with open() method:

with open("1.txt","r") as f:
    my_list = [int(i)  for j in f for i in j.strip().split(" ") if i.isdigit()]

process:

1.First you will be iterating over the line

2.Then you will be iterating over the words and see it they are digit if so we add them to list

edit:

You need to addstrip()to line because every end of line (except last line) will have new line space ("\n") in them and is you try is.digit("number\n") you will get false

i.e)

>>> "1\n".isdigit()
False

edit2:

Input:

1
qw 2
23 we 32

File data when read:

a=open("1.txt","r")

repr(a.read())
"'1\\nqw 2\\n23 we 32'"

You can see the "\n" new line right it will affect the process

When I run the function with out strip() it will not take 1 and 2 as a digit because it consists of new line characters

my_list = [int(i)  for j in open("1.txt","r") for i in j.split(" ") if i.isdigit()]
my_list
[23, 32]

From the output it is clear 1 and 2 are missing .This can be avoided if we used strip()

Upvotes: 3

simleo
simleo

Reputation: 2975

my_list = []
with open('foo.txt') as f:
    for line in f:
        for s in line.split():
            try:
                my_list.append(int(s))
            except ValueError:
                pass

Upvotes: 3

allencharp
allencharp

Reputation: 1161

why not use yield keyword ? the code will be as...

def readInt():
    for line in open("foo.txt", "r"):
        for i in line.strip().split(' '):
            if i.isdigit():
                yield int(i)

then you can read

    for num in readInt():
        list.append(num)

Upvotes: 3

Related Questions