Reputation: 1570
I'm banging my head against a wall: I'd like to add all values by row for all rows that have an identical value in the column 'fusion'. Here a dummy example:
import pandas as pd
rows_l = [['AFF1_KMT2A', 3.0, 1.0, 1.0, 1.0, 4, 6.0],
['AFF1_KMT2A', 3.0, 2.0, 1.0, 0.0, 3, 6.0],
['TCF3_PBX1', 3.0, 1.0, 1.0, 0.0, 3, 5.0],
['TCF3_PBX1', 0.0, 0.0, 0.0, 1.0, 1, 1.0],
['ABL1_BCR', 1.0, 1.0, 1.0, 2.0, 4, 5.0]]
col_l = ['fusion', 'FusionCatcher', 'Manta', 'STARfusion', 'TopHat','tool_count', 'tot']
my_df = pd.DataFrame(rows_l, columns=col_l)
Gives me this my_df:
fusion | FusionCatcher | Manta | STARfusion | TopHat | tool_count | tot | |
---|---|---|---|---|---|---|---|
0 | ABL1_BCR | 1.0 | 1.0 | 1.0 | 2.0 | 4 | 5.0 |
1 | AFF1_KMT2A | 3.0 | 2.0 | 1.0 | 0.0 | 3 | 6.0 |
2 | TCF3_PBX1 | 3.0 | 1.0 | 1.0 | 0.0 | 3 | 5.0 |
3 | AFF1_KMT2A | 3.0 | 1.0 | 1.0 | 1.0 | 4 | 6.0 |
4 | TCF3_PBX1 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 1.0 |
Of course in my real file the identical lines are not always one line apart, but all over the place...
My output should look like this:
fusion | FusionCatcher | Manta | STARfusion | TopHat | tool_count | tot | |
---|---|---|---|---|---|---|---|
0 | AFF1_KMT2A | 6.0 | 3.0 | 2.0 | 1.0 | 7 | 12.0 |
1 | TCF3_PBX1 | 3.0 | 1.0 | 1.0 | 1.0 | 4 | 6.0 |
2 | ABL1_BCR | 1.0 | 1.0 | 1.0 | 2.0 | 4 | 5.0 |
So far I've only found df['whatever_you_want_to_sum'].groupby(df['col_names_if_same_wil_be_bundled'])
but as this seems to work only for 2 columns that does not help me much.
Is there any particular python pandas
way to do this or should I just continue to write my for-loops
until the prints look right?
Upvotes: 1
Views: 158
Reputation: 1646
Can you try using pivot tables?
pd.pivot_table(my_df, index="fusion", values=col_l, aggfunc=np.sum)
You also need to import numpy as np
beforehand.
You get a pivot table as shown:
Upvotes: 1