Using numpy load only lines fitting certain criteria

Question

Thanks in advance for the help.

I'm relatively new to python and am trying to write a python script to load partial csv files from 1000 files. For example, I have 1000 files that have this format

x,y
1,2
2,4
2,2
3,9
...

I would like to load only lines, for example, where x=2. I've seen a lot of posts on here about picking certain lines (ie lines 1,2,3), but not picking lines that fit certain criteria. One solution would be to simply open each file individually and iterate through each one, loading lines as I go. However, I would imagine there is a much better way of doing this (efficiency is somewhat of a concern as these files are not small).

One point that might speed things up is that the x column is sorted, ie once I see a value x = a, I will never see another x value less than a as I iterate through the lines from the beginning.

Is there a more efficient way of doing this rather than going through each file line by line?

Edit: One approach that I have taken is

numpy.fromregex(file, r'^' + re.compile(str(mynum)) + r'\,\-\d$', dtype='f');

where mynum is the number I want, but this is not working

vsminkov · Accepted Answer

Try pandas library. It has an interoperability with numpy and way more flexible. With this library you do next thing:

data = pandas.read_csv('file.csv')
# keep only rows with x equals to 2
data = data[data['x'] == 2]
# convert to numpy array 
arr = numpy.asarray(data)

You can read more about selecting data with here.

Using numpy load only lines fitting certain criteria

Answers (2)

Related Questions