Reputation: 613
I have large CSVs where I'm only interested in a subset of the rows. In particular, I'd like to read in all the rows which occur before a particular condition is met.
For example, if read_csv
would yield the dataframe:
A B C
1 34 3.20 'b'
2 24 9.21 'b'
3 34 3.32 'c'
4 24 24.3 'c'
5 35 1.12 'a'
...
1e9 42 2.15 'd'
is there some way to read all the rows in the csv until col B exceeds 10. In the above example, I'd like to read in:
A B C
1 34 3.20 'b'
2 24 9.21 'b'
3 34 3.32 'c'
4 24 24.3 'c'
I know how to throw these rows out once I've read the dataframe in, but at this point I've already spent all that computation reading them in. I do not have access to the index of the final row before reading the csv (no skipfooter please)
Upvotes: 21
Views: 48393
Reputation: 781
Warning:
pd.read_csv("filename.csv")
loads the entirecsv
into an in-memoryDataFrame
before processing it (thanks @BlackJack for pointing this out). If thecsv
is big @unutbu answer is more appropriate (or perhaps another library likepolars
which can read a file in chunks & apply multiple operations thanks to its query planner)
Building on @joanwa answer:
df = (pd.read_csv("filename.csv")
[lambda x: x['B'] > 10])
From Wes McKinney's "Python for Data Analysis" chapter on "Advanced pandas":
We cannot refer to the result of load_data until it has been assigned to the temporary variable df. To help with this, assign and many other pandas functions accept function-like arguments, also known as callables.
To show callables in action, consider ...
df = load_data()
df2 = df[df['col2'] < 0]
Can be rewritten as:
df = (load_data()
[lambda x: x['col2'] < 0])
Upvotes: 7
Reputation: 4159
Instead of boolean indexing or giving callable, one can use query
method.
import pandas as pd
df = (pd.read_csv("my_data.csv").query("B < 10"))
I don't know how fast this solution is, however it should be faster then giving plain callable, especially lambda.
Upvotes: -1
Reputation: 164783
You can use the built-in csv
module to calculate the appropriate row number. Then use pd.read_csv
with the nrows
argument:
from io import StringIO
import pandas as pd
import csv, copy
mycsv = StringIO(""" A B C
34 3.20 'b'
24 9.21 'b'
34 3.32 'c'
24 24.3 'c'
35 1.12 'a'""")
mycsv2 = copy.copy(mycsv) # copying StringIO object [for demonstration purposes]
with mycsv as fin:
reader = csv.reader(fin, delimiter=' ', skipinitialspace=True)
header = next(reader)
counter = next(idx for idx, row in enumerate(reader) if float(row[1]) > 10)
df = pd.read_csv(mycsv2, delim_whitespace=True, nrows=counter+1)
print(df)
A B C
0 34 3.20 'b'
1 24 9.21 'b'
2 34 3.32 'c'
3 24 24.30 'c'
Upvotes: 1
Reputation: 47
I would go the easy route described here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
df[df['B'] > 10]
Upvotes: -2
Reputation: 880577
You could read the csv in chunks. Since pd.read_csv
will return an iterator when the chunksize
parameter is specified, you can use itertools.takewhile
to read only as many chunks as you need, without reading the whole file.
import itertools as IT
import pandas as pd
chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
chunks = IT.takewhile(lambda chunk: chunk['B'].iloc[-1] < 10, chunks)
df = pd.concat(chunks)
mask = df['B'] < 10
df = df.loc[mask]
Or, to avoid having to use df.loc[mask]
to remove unwanted rows from the last chunk, perhaps a cleaner solution would be to define a custom generator:
import itertools as IT
import pandas as pd
def valid(chunks):
for chunk in chunks:
mask = chunk['B'] < 10
if mask.all():
yield chunk
else:
yield chunk.loc[mask]
break
chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
df = pd.concat(valid(chunks))
Upvotes: 26