HXSP1947
HXSP1947

Reputation: 1351

Using numpy load only lines fitting certain criteria

Thanks in advance for the help.

I'm relatively new to python and am trying to write a python script to load partial csv files from 1000 files. For example, I have 1000 files that have this format

x,y
1,2
2,4
2,2
3,9
...

I would like to load only lines, for example, where x=2. I've seen a lot of posts on here about picking certain lines (ie lines 1,2,3), but not picking lines that fit certain criteria. One solution would be to simply open each file individually and iterate through each one, loading lines as I go. However, I would imagine there is a much better way of doing this (efficiency is somewhat of a concern as these files are not small).

One point that might speed things up is that the x column is sorted, ie once I see a value x = a, I will never see another x value less than a as I iterate through the lines from the beginning.

Is there a more efficient way of doing this rather than going through each file line by line?

Edit: One approach that I have taken is

numpy.fromregex(file, r'^' + re.compile(str(mynum)) + r'\,\-\d$', dtype='f');

where mynum is the number I want, but this is not working

Upvotes: 1

Views: 47

Answers (2)

bddap
bddap

Reputation: 584

The csv library comes with python and it allows for partial reading of a file.

import csv

def partial_load(filename):
    ds = []
    c = csv.reader( open(filename) )
    legend = next( c )
    for row in c:
        row = [float(r) for r in row]
        if len(row) > 0:
            if row[0] > 2:
                break
            ds.append(row)
    return ds

Upvotes: 0

vsminkov
vsminkov

Reputation: 11250

Try pandas library. It has an interoperability with numpy and way more flexible. With this library you do next thing:

data = pandas.read_csv('file.csv')
# keep only rows with x equals to 2
data = data[data['x'] == 2]
# convert to numpy array 
arr = numpy.asarray(data)

You can read more about selecting data with here.

Upvotes: 1

Related Questions