Reputation: 113
I got a dataframe like below. I want to make a new column with the total steps count. I got a table like below. You can see that ID 1 has 5 steps.
+----+--------------------------------------------------------+
| ID | Steps |
+----+--------------------------------------------------------+
| 1 | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| 2 | <DIV><P>Another step</P></DIV> |
| | <DIV><P>Something</P></DIV> |
| | <DIV><P>Something</P></DIV> |
| | <DIV><P>Something</P></DIV> |
| | <DIV><P>Something</P></DIV> |
+----+--------------------------------------------------------+
I want to use the ‘DIV’ to count the total amount of steps by the right ID and make a new column with the total count of the steps.
+----+--------------------------------------------------------+-------------+
| ID | Steps | Total_Steps |
+----+--------------------------------------------------------+-------------+
| 1 | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | 10 |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | |
| 2 | <DIV><P>Another step</P></DIV> | 5 |
| | <DIV><P>Something</P></DIV> | |
| | <DIV><P>Something</P></DIV> | |
| | <DIV><P>Something</P></DIV> | |
| | <DIV><P>Something</P></DIV> | |
| 3 | <DIV><P>Just a step</P></DIV> | 4 |
| | <DIV><P>Just a step</P></DIV> | |
| | <DIV><P>Just a step</P></DIV> | |
| | <DIV><P>Just a step</P></DIV> | |
+----+--------------------------------------------------------+-------------+
Upvotes: 0
Views: 44
Reputation: 863236
Use Series.str.count
with GroupBy.transform
and sum
:
df['Total_Steps'] = df['Steps'].str.count('<DIV>').groupby(df['ID'].ffill()).transform('sum')
print (df)
ID Steps Total_Steps
0 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
1 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
2 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
3 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
4 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
5 2 <DIV><P>Another step</P></DIV> 5
6 2 <DIV><P>Something</P></DIV> 5
7 2 <DIV><P>Something</P></DIV> 5
8 2 <DIV><P>Something</P></DIV> 5
9 2 <DIV><P>Something</P></DIV> 5
If need only first values add numpy.where
with Series.duplicated
:
s = df['Steps'].str.count('<DIV>').groupby(df['ID'].ffill()).transform('sum')
df['Total_Steps'] = np.where(df['ID'].duplicated(), np.nan, s)
#possible mixed values - numeric with empty strings, but then some function should failed
#df['Total_Steps'] = np.where(df['ID'].duplicated(), '', s)
print (df)
ID Steps Total_Steps
0 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10.0
1 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... NaN
2 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... NaN
3 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... NaN
4 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... NaN
5 2 <DIV><P>Another step</P></DIV> 5.0
6 2 <DIV><P>Something</P></DIV> NaN
7 2 <DIV><P>Something</P></DIV> NaN
8 2 <DIV><P>Something</P></DIV> NaN
9 2 <DIV><P>Something</P></DIV> NaN
Upvotes: 1
Reputation: 46409
Why not like this:
df['Total_Steps']=df['steps'].str.contains('\<Div\>\<P\>').count()
Upvotes: 0