Reputation: 2747
I have a pandas DataFrame df1 with the following content:
Serial N year current
B 10 14
B 10 16
B 11 10
B 11
B 11 15
C 12 11
C 9
C 12 13
C 12
D 3 4
I would like to count the number of occurrences of of each serial unique serial. If the serial number is less than 2, I would like to replace year and current for that row to nan. I would like to have something like this:
Serial N year current
B 10 14
B 10 16
B 11 10
B 11
B 11 15
C 12 11
C 9
C 12 13
C 12
D nan nan
Upvotes: 1
Views: 82
Reputation: 33793
You can combine value_counts
, lt
and reindex
to get a boolean array of where to change values to nan
, and then use loc
to make the changes.
serial_filter = df1['Serial N'].value_counts().lt(2).reindex(df1['Serial N'])
df1.loc[serial_filter.values, ['year', 'current']] = np.nan
The resulting output:
Serial N year current
0 B 10.0 14.0
1 B 10.0 16.0
2 B 11.0 10.0
3 B 11.0 NaN
4 B 11.0 15.0
5 C 12.0 11.0
6 C NaN 9.0
7 C 12.0 13.0
8 C 12.0 NaN
9 D NaN NaN
Upvotes: 1
Reputation: 294348
import pandas as pd
from StringIO import StringIO
text = """Serial_N year current
B 10 14
B 10 16
B 11 10
B 11 nan
B 11 15
C 12 11
C nan 9
C 12 13
C 12 nan
D 3 4"""
df1 = pd.read_csv(StringIO(text), delim_whitespace=True)
df1.columns = ['Serial N', 'year', 'current']
Now I have the same df1
you showed above.
serial_filter = df1.groupby('Serial N').apply(lambda x: len(x))
serial_filter = serial_filter[serial_filter > 1]
mask = df1.apply(lambda x: x.ix['Serial N'] in serial_filter, axis=1)
df1 = df1[mask]
serial_filter = df1.groupby('Serial N').apply(lambda x: len(x))
print serial_filter
Serial N
B 5
C 4
D 1
dtype: int64
produce a count of each unique Serial N
serial_filter = serial_filter[serial_filter > 1]
print serial_filter
Serial N
B 5
C 4
dtype: int64
Redefine it such that it only includes those Serial N
's greater than 1
mask = df1.apply(lambda x: x.ix['Serial N'] in serial_filter, axis=1)
print mask
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
dtype: bool
Create a filter mask to use on df1
df1 = df1[mask]
print df1
Serial N year current
0 B 10.0 14.0
1 B 10.0 16.0
2 B 11.0 10.0
3 B 11.0 NaN
4 B 11.0 15.0
5 C 12.0 11.0
6 C NaN 9.0
7 C 12.0 13.0
8 C 12.0 NaN
Update df1
Upvotes: 0