Reputation: 804
I have a dataframe like this:
+----+--------------+-----------+---------------------------------------------------+-----------+
| | Filename | Result | IssueType | isBad |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 0 | E0CCG5S237-0 | Bad | NaN | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 1 | E0CCG5S237-0 | Bad | OCR_Text Misrecognition | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 2 | E0CCG5S237-1 | Good | NaN | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 3 | E0CCG5S238-0 | Tolerable | MA_Form field elements (checkbox, line element... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 4 | E0CCG5S238-0 | Tolerable | NaN | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 5 | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 6 | E0CCG5S239-0 | Tolerable | Extra Spaces | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 7 | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 8 | E0CCG5S239-0 | Tolerable | CS_Font Incosistency | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 9 | E0CCG5S242-0 | Bad | ML-OrphanContent | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 10 | E0CCG5S242-0 | Bad | Extra Spaces | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
I want to group the rows by Filename
and Result
, for which I have made the query:
subj_score_df = subj_score_df.fillna('').groupby(['Filename', 'Result'])['IssueType'].apply('\n'.join).reset_index()
But I want to remove IssueType
value (to NaN
) if isBad
column is either ('No', 'Tolerable')
and there exists at least one other row with same filename where isBad
column has value 'Bad'
.
If there are no row where isBad
column is 'Bad'
then no changes in IssueType.
( For example, here #10 row IssueType
will be NaN
as #9 has same filename but has isBad = Yes
)
Output Dataframe after that:
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| | Filename | Result | IssueType | isBad | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 0 | E0CCG5S237-0 | Bad | NaN | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 1 | E0CCG5S237-0 | Bad | OCR_Text Misrecognition | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 2 | E0CCG5S237-1 | Good | NaN | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 3 | E0CCG5S238-0 | Tolerable | NaN | NaN | #4's isBad is Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 4 | E0CCG5S238-0 | Tolerable | NaN | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 5 | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable | All are tolerable so no change |
+----+--------------+-----------+---------------------------------------------------+-----------+ |
| 6 | E0CCG5S239-0 | Tolerable | Extra Spaces | Tolerable | |
+----+--------------+-----------+---------------------------------------------------+-----------+ |
| 7 | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV | Tolerable | |
+----+--------------+-----------+---------------------------------------------------+-----------+ |
| 8 | E0CCG5S239-0 | Tolerable | CS_Font Incosistency | Tolerable | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 9 | E0CCG5S242-0 | Bad | ML-OrphanContent | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 10 | E0CCG5S242-0 | Bad | NaN | Tolerable | #9's isBad is Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
Is there a way to achieve that?
Upvotes: 2
Views: 546
Reputation: 862741
I think you need mask for compare first isBad
by Series.eq
with GroupBy.transform
and DataFrameGroupBy.any
:
mask = df['isBad'].eq('Yes').groupby(df['Filename']).transform('any')
Or use Series.isin
with Filename
s if isBad
match condition:
mask = df['Filename'].isin(df.loc[df['isBad'].eq('Yes'), 'Filename'])
Last set missing values in Series.mask
with chained condition for match only Tolerable
:
df['IssueType'] = df['IssueType'].mask(mask & df['isBad'].eq('Tolerable'))
print (df)
Filename Result IssueType isBad
0 E0CCG5S2370 Bad NaN Yes
1 E0CCG5S2370 Bad OCR_Text Misrecognition Yes
2 E0CCG5S2371 Good NaN Yes
3 E0CCG5S2380 Tolerable NaN Tolerable
4 E0CCG5S2380 Tolerable NaN Yes
5 E0CCG5S2390 Tolerable MA_Superscript,subscript. Tolerable
6 E0CCG5S2390 Tolerable Extra Spaces Tolerable
7 E0CCG5S2390 Tolerable MA_Link missing from the DV Tolerable
8 E0CCG5S2390 Tolerable CS_Font Incosistency Tolerable
9 E0CCG5S2420 Bad MLOrphanContent Yes
10 E0CCG5S2420 Bad NaN Tolerable
Upvotes: 1