Stacey
Stacey

Reputation: 5097

Conditional Iteration over a dataframe

I have a dataframe df which looks like:

    id              location    grain
0   BBG.XETR.AD.S       XETR    16.545
1   BBG.XLON.VB.S       XLON    6.2154
2   BBG.XLON.HF.S       XLON    NaN
3   BBG.XLON.RE.S       XLON    NaN
4   BBG.XLON.LL.S       XLON    NaN
5   BBG.XLON.AN.S       XLON    3.215
6   BBG.XLON.TR.S       XLON    NaN
7   BBG.XLON.VO.S       XLON    NaN

In reality this dataframe will be much larger. I would like to iterate over this dataframe returning the 'grain' value but I am only interested in the rows that have a value (not NaN) in the 'grain' column. So only returning as I iterate over the dataframe the following values:

16.545
6.2154
3.215

I can iterate over the dataframe using:

for staticidx, row in df.iterrows():
            value= row['grain']

But this returns a value for all rows including those with a NaN value. Is there a way to either remove the NaN rows from the dataframe or skip the rows in the dataframe where grain equals NaN?

Many thanks

Upvotes: 1

Views: 646

Answers (3)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

Let's compare different methods (for 800K rows DF):

In [21]: df = pd.concat([df] * 10**5, ignore_index=True)

In [22]: df.shape
Out[22]: (800000, 3)

In [23]: %timeit df.grain[~pd.isnull(df.grain)]
The slowest run took 5.33 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 17.1 ms per loop

In [24]: %timeit df.ix[df.grain.notnull(), 'grain']
10 loops, best of 3: 23.9 ms per loop

In [25]: %timeit df[pd.notnull(df['grain'])]
10 loops, best of 3: 35.9 ms per loop

In [26]: %timeit df.grain.ix[df.grain.notnull()]
100 loops, best of 3: 17.4 ms per loop

In [27]: %timeit df.dropna(subset=['grain'])
10 loops, best of 3: 56.6 ms per loop

In [28]: %timeit df.grain[df.grain.notnull()]
100 loops, best of 3: 17 ms per loop

In [30]: %timeit df['grain'].dropna()
100 loops, best of 3: 16.3 ms per loop

Upvotes: 0

Kartik
Kartik

Reputation: 8683

This:

df[pd.notnull(df['grain'])]

Or this:

df['grain].dropna()

Upvotes: 0

Alexander
Alexander

Reputation: 109546

You can specify a list of columns in dropna on which to subset the data:

subset : array-like Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include

>>> df.dropna(subset=['grain'])
              id location    grain
0  BBG.XETR.AD.S     XETR  16.5450
1  BBG.XLON.VB.S     XLON   6.2154
5  BBG.XLON.AN.S     XLON   3.2150

Upvotes: 1

Related Questions