Reputation: 61
I used to clean data using SAS but I would like to switch to Python.
I had a large dataset which was scrapped from some filings (html) but included some noisy information and I would like to get rid of these irrelevant data.
Basically, I need to remove certain rows of data after a row with condition being True (however, this might be a list, multiple True/or no True at all; and if there are Trues, I want to identify the last one).
Raw data:
<table>
<tr>
<td>Report_ID</td>
<td>Table_ID</td>
<td>Group_ID</td>
<td>Item_ID</td>
<td>Flag_old</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item1</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item2</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item3</td>
<td>1</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item4</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item5</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>2</td>
<td>item1</td>
<td>1</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>2</td>
<td>item2</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>2</td>
<td>item3</td>
<td>1</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>2</td>
<td>item4</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>3</td>
<td>item1</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>3</td>
<td>item2</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>3</td>
<td>item3</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>3</td>
<td>item4</td>
<td>0</td>
</tr>
</table>
Expected data:
<table>
<tr>
<td>Report_ID</td>
<td>Table_ID</td>
<td>Group_ID</td>
<td>Item_ID</td>
<td>Flag_old</td>
<td>Flag_new</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item3</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item4</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>1</td>
<td>item5</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>2</td>
<td>item1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>2</td>
<td>item2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>2</td>
<td>item3</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>2</td>
<td>item4</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>3</td>
<td>item1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>3</td>
<td>item2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>3</td>
<td>item3</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>A</td>
<td>1</td>
<td>3</td>
<td>item4</td>
<td>0</td>
<td>0</td>
</tr>
</table>
As you can see from the above, I wanted to identify rows below the rows with condition Flag_old == 1.
Given the structure of the data, I have firstly use groupby to segment my whole dataframe and I was thinking to define a function to select rows and apply the function to the dataframe groupby object and then of course create a new column for the whole dataframe indicating these rows of noisy data.
def lastline(series):
return max(series[series.values == 1].index)
df['lastline'] = df.groupby('id').apply(lastline(df['flag']))
but I got 'int' object is not callable
error.
Could you please advice me how to do this properly? I have been struggling with this for few days now...Many thanks.
Upvotes: 3
Views: 104
Reputation: 862681
I think you need custom function with transform
for return new column:
def f(x):
#get cumulative sum, shift
a = x.cumsum().shift()
#check max value of cumsumed a and chain condition for remove 0 only groups
#convert Trues to 1 by astype
return ((a == a.max()) & (a != 0)).astype(int)
df['Flag_new'] = df.groupby('Group_ID')['Flag_old'].transform(f)
print (df)
Report_ID Table_ID Group_ID Item_ID Flag_old Flag_new
0 A 1 1 item1 0 0
1 A 1 1 item2 0 0
2 A 1 1 item3 1 0
3 A 1 1 item4 0 1
4 A 1 1 item5 0 1
5 A 1 2 item1 1 0
6 A 1 2 item2 0 0
7 A 1 2 item3 1 0
8 A 1 2 item4 0 1
9 A 1 3 item1 0 0
10 A 1 3 item2 0 0
11 A 1 3 item3 0 0
12 A 1 3 item4 0 0
Upvotes: 1
Reputation: 1838
This might be a little complicated for a one-liner with apply
but you can use:
df['flag_new'] = df.groupby("Group_ID").apply(lambda g_df: [0]* len(g_df['Flag_old']) if g_df['Flag_old'].sum() == 0 else [0]* (len(g_df['Flag_old'])-1) +[1]).apply(pd.Series).stack().reset_index(drop=True)
or you can use transform
:
df['flag_new'] = df.groupby("Group_ID")['flag'].transform(lambda x: [0]* len(x) if x.sum() == 0 else [0]* (len(x)-1) +[1])
In both cases the output will be:
Report_ID Table_ID Group_ID Item_ID Flag_old Flag_new
0 A 1 1 item1 0 0
1 A 1 1 item2 0 0
2 A 1 1 item3 1 0
3 A 1 1 item4 0 1
4 A 1 2 item1 1 0
5 A 1 2 item2 0 0
6 A 1 2 item3 1 0
7 A 1 2 item4 0 1
8 A 1 3 item1 0 0
9 A 1 3 item2 0 0
10 A 1 3 item3 0 0
11 A 1 3 item4 0 0
Upvotes: 0