naqushab
naqushab

Reputation: 804

Pandas delete a column value if there exists a row based on another column value

I have a dataframe like this:

+----+--------------+-----------+---------------------------------------------------+-----------+
|    | Filename     | Result    | IssueType                                         | isBad     |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 0  | E0CCG5S237-0 | Bad       | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 1  | E0CCG5S237-0 | Bad       | OCR_Text Misrecognition                           | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 2  | E0CCG5S237-1 | Good      | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 3  | E0CCG5S238-0 | Tolerable | MA_Form field elements (checkbox, line element... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 4  | E0CCG5S238-0 | Tolerable | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 5  | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 6  | E0CCG5S239-0 | Tolerable | Extra Spaces                                      | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 7  | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV                       | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 8  | E0CCG5S239-0 | Tolerable | CS_Font Incosistency                              | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 9  | E0CCG5S242-0 | Bad       | ML-OrphanContent                                  | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 10 | E0CCG5S242-0 | Bad       | Extra Spaces                                      | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+

I want to group the rows by Filename and Result, for which I have made the query:
subj_score_df = subj_score_df.fillna('').groupby(['Filename', 'Result'])['IssueType'].apply('\n'.join).reset_index()

But I want to remove IssueType value (to NaN) if isBad column is either ('No', 'Tolerable') and there exists at least one other row with same filename where isBad column has value 'Bad'.

If there are no row where isBad column is 'Bad' then no changes in IssueType.

( For example, here #10 row IssueType will be NaN as #9 has same filename but has isBad = Yes )

Output Dataframe after that:

+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
|    | Filename     | Result    | IssueType                                         | isBad     |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 0  | E0CCG5S237-0 | Bad       | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 1  | E0CCG5S237-0 | Bad       | OCR_Text Misrecognition                           | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 2  | E0CCG5S237-1 | Good      | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 3  | E0CCG5S238-0 | Tolerable | NaN                                               | NaN       | #4's isBad is Yes                |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 4  | E0CCG5S238-0 | Tolerable | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 5  | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable | All are tolerable so no   change |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 6  | E0CCG5S239-0 | Tolerable | Extra Spaces                                      | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 7  | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV                       | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 8  | E0CCG5S239-0 | Tolerable | CS_Font Incosistency                              | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 9  | E0CCG5S242-0 | Bad       | ML-OrphanContent                                  | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 10 | E0CCG5S242-0 | Bad       | NaN                                               | Tolerable | #9's isBad is Yes                |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+

Is there a way to achieve that?

Upvotes: 2

Views: 546

Answers (1)

jezrael
jezrael

Reputation: 862741

I think you need mask for compare first isBad by Series.eq with GroupBy.transform and DataFrameGroupBy.any:

mask = df['isBad'].eq('Yes').groupby(df['Filename']).transform('any')

Or use Series.isin with Filenames if isBad match condition:

mask = df['Filename'].isin(df.loc[df['isBad'].eq('Yes'), 'Filename'])

Last set missing values in Series.mask with chained condition for match only Tolerable:

df['IssueType'] = df['IssueType'].mask(mask & df['isBad'].eq('Tolerable'))
print (df)
       Filename     Result                    IssueType      isBad
0   E0CCG5S2370        Bad                          NaN        Yes
1   E0CCG5S2370        Bad      OCR_Text Misrecognition        Yes
2   E0CCG5S2371       Good                          NaN        Yes
3   E0CCG5S2380  Tolerable                          NaN  Tolerable
4   E0CCG5S2380  Tolerable                          NaN        Yes
5   E0CCG5S2390  Tolerable    MA_Superscript,subscript.  Tolerable
6   E0CCG5S2390  Tolerable                 Extra Spaces  Tolerable
7   E0CCG5S2390  Tolerable  MA_Link missing from the DV  Tolerable
8   E0CCG5S2390  Tolerable         CS_Font Incosistency  Tolerable
9   E0CCG5S2420        Bad              MLOrphanContent        Yes
10  E0CCG5S2420        Bad                          NaN  Tolerable

Upvotes: 1

Related Questions