user3841581
user3841581

Reputation: 2747

Getting the content of a pandas row based on some conditions of other row

I have a pandas DataFrame df1 with the following content:

Serial N         year         current
   B              10            14
   B              10            16
   B              11            10
   B              11            
   B              11            15
   C              12            11
   C                            9
   C              12            13
   C              12           
   D               3             4

I would like to count the number of occurrences of of each serial unique serial. If the serial number is less than 2, I would like to replace year and current for that row to nan. I would like to have something like this:

Serial N         year         current
   B              10            14
   B              10            16
   B              11            10
   B              11            
   B              11            15
   C              12            11
   C                             9
   C              12            13
   C              12 
   D              nan           nan      

Upvotes: 1

Views: 82

Answers (2)

root
root

Reputation: 33793

You can combine value_counts, lt and reindex to get a boolean array of where to change values to nan, and then use loc to make the changes.

serial_filter = df1['Serial N'].value_counts().lt(2).reindex(df1['Serial N'])
df1.loc[serial_filter.values, ['year', 'current']] = np.nan

The resulting output:

  Serial N  year  current
0        B  10.0     14.0
1        B  10.0     16.0
2        B  11.0     10.0
3        B  11.0      NaN
4        B  11.0     15.0
5        C  12.0     11.0
6        C   NaN      9.0
7        C  12.0     13.0
8        C  12.0      NaN
9        D   NaN      NaN

Upvotes: 1

piRSquared
piRSquared

Reputation: 294348

Setup

import pandas as pd
from StringIO import StringIO

text = """Serial_N         year         current
   B              10            14
   B              10            16
   B              11            10
   B              11            nan
   B              11            15
   C              12            11
   C              nan              9
   C              12            13
   C              12           nan
   D               3             4"""

df1 = pd.read_csv(StringIO(text), delim_whitespace=True)
df1.columns = ['Serial N', 'year', 'current']

Now I have the same df1 you showed above.

Solution

serial_filter = df1.groupby('Serial N').apply(lambda x: len(x))
serial_filter = serial_filter[serial_filter > 1]
mask = df1.apply(lambda x: x.ix['Serial N'] in serial_filter, axis=1)
df1 = df1[mask]

Demonstration and Explanation

serial_filter = df1.groupby('Serial N').apply(lambda x: len(x))

print serial_filter

Serial N
B    5
C    4
D    1
dtype: int64

produce a count of each unique Serial N

serial_filter = serial_filter[serial_filter > 1]

print serial_filter

Serial N
B    5
C    4
dtype: int64

Redefine it such that it only includes those Serial N's greater than 1

mask = df1.apply(lambda x: x.ix['Serial N'] in serial_filter, axis=1)

print mask

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9    False
dtype: bool

Create a filter mask to use on df1

df1 = df1[mask]

print df1

  Serial N  year  current
0        B  10.0     14.0
1        B  10.0     16.0
2        B  11.0     10.0
3        B  11.0      NaN
4        B  11.0     15.0
5        C  12.0     11.0
6        C   NaN      9.0
7        C  12.0     13.0
8        C  12.0      NaN

Update df1

Upvotes: 0

Related Questions