Reputation: 1351
Thanks in advance for the help.
I'm relatively new to python and am trying to write a python script to load partial csv files from 1000 files. For example, I have 1000 files that have this format
x,y
1,2
2,4
2,2
3,9
...
I would like to load only lines, for example, where x=2
. I've seen a lot of posts on here about picking certain lines (ie lines 1,2,3), but not picking lines that fit certain criteria. One solution would be to simply open each file individually and iterate through each one, loading lines as I go. However, I would imagine there is a much better way of doing this (efficiency is somewhat of a concern as these files are not small).
One point that might speed things up is that the x column is sorted, ie once I see a value x = a, I will never see another x value less than a as I iterate through the lines from the beginning.
Is there a more efficient way of doing this rather than going through each file line by line?
Edit: One approach that I have taken is
numpy.fromregex(file, r'^' + re.compile(str(mynum)) + r'\,\-\d$', dtype='f');
where mynum is the number I want, but this is not working
Upvotes: 1
Views: 47
Reputation: 584
The csv library comes with python and it allows for partial reading of a file.
import csv
def partial_load(filename):
ds = []
c = csv.reader( open(filename) )
legend = next( c )
for row in c:
row = [float(r) for r in row]
if len(row) > 0:
if row[0] > 2:
break
ds.append(row)
return ds
Upvotes: 0
Reputation: 11250
Try pandas library. It has an interoperability with numpy and way more flexible. With this library you do next thing:
data = pandas.read_csv('file.csv')
# keep only rows with x equals to 2
data = data[data['x'] == 2]
# convert to numpy array
arr = numpy.asarray(data)
You can read more about selecting data with here.
Upvotes: 1