Reputation: 121
I have a pandas dataframe in python, with the following structure:
Date | A | B | C | D | E | F | G | H |
---|---|---|---|---|---|---|---|---|
2022-01-01 10:00:00 | 1 | 3 | 0 | 4 | 1 | 0 | 0 | 1 |
2022-01-01 11:00:00 | 0 | 1 | 1 | 2 | 0 | 2 | 3 | 5 |
2022-01-01 12:00:00 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 0 |
2022-01-01 13:00:00 | 0 | 0 | 2 | 0 | 0 | 3 | 1 | 2 |
2022-01-02 10:00:00 | 0 | 3 | 0 | 1 | 2 | 4 | 4 | 1 |
2022-01-02 11:00:00 | 0 | 0 | 6 | 0 | 0 | 0 | 5 | 0 |
2022-01-02 12:00:00 | 0 | 0 | 6 | 0 | 2 | 4 | 5 | 3 |
2022-01-02 13:00:00 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 0 |
This is a snippet of my real dataframe, which has lots of "letter" columns (more than 100). My problem is that I need to obtain the sum of the numbers for each datetime and letter for different combinations.
This was answered in All possible combinations as new columns of a dataframe based on primary column, but it computes all possible combinations from N to M elements. For a df of 100 columns, all possible combinations from 1 sigle letter column to the full combined 100 is madness and impossible to compute.
Fortunately, I am only interested in certain combinations (again, ca. 50 in the real df), which are given in a list:
list_possibilities = ['A+B', 'A+B+D', 'B+D+E+G', 'F+H', 'D+E+F+G+H', 'D', 'F', 'G+H']
As you can see, single columns such as "D" or "F" may also be a possibility. So the final df would be fairly compact, and wouldn't need a brute-force computation of thousands of combinations (here shown only for the first two rows for simplicity):
Date | A+B | A+B+D | B+D+E+G | F+H | D+E+F+G+H | D | F | G+H |
---|---|---|---|---|---|---|---|---|
2022-01-01 10:00:00 | 4 | 8 | 8 | 1 | 6 | 4 | 0 | 1 |
2022-01-01 11:00:00 | 1 | 3 | 6 | 7 | 12 | 2 | 2 | 8 |
Knowing the prior structure of the combinations allows to exponentially decrease the combinations and sums pandas must do, but I am unable to generalize the previous code of the solution to this case. Any ideas?
Upvotes: 1
Views: 69
Reputation: 260640
Use concat
in a dictionary comprehension:
out = (pd
.concat({cols: df.set_index('Date')[cols.split('+')].sum(axis=1)
for cols in list_possibilities}, axis=1)
.reset_index()
)
Output:
Date A+B A+B+D B+D+E+G F+H D+E+F+G+H D F G+H
0 2022-01-01 10:00:00 4 8 8 1 6 4 0 1
1 2022-01-01 11:00:00 1 3 6 7 12 2 2 8
2 2022-01-01 12:00:00 2 3 3 1 2 1 1 0
3 2022-01-01 13:00:00 0 0 1 5 6 0 3 3
4 2022-01-02 10:00:00 3 4 10 5 12 1 4 5
5 2022-01-02 11:00:00 0 0 5 0 5 0 0 5
6 2022-01-02 12:00:00 0 0 7 7 14 0 4 8
7 2022-01-02 13:00:00 0 1 4 0 4 1 0 0
Upvotes: 4