Reputation: 1

Problem with Dataframe Aggregation on Pandas

I have a dataframe structured like this:

<table border=1>
<tr><th>FIRST</th><th>SECOND</th><th>COUNT</th></tr>
<tr><td>1</td><td>['A','B']</td><td>['2','1']</tr>
<tr><td>2</td><td>['C','D']</td><td>['1','1']</tr>
<tr><td>1</td><td>['A','E']</td><td>['1','1']</tr>
<tr><td>2</td><td>['C','F']</td><td>['2','1']</tr>
</table>

Where 'SECOND' is a list and 'COUNT' is a list of occurrencies (for example I have 2 occurrencies of 'A', 1 occurr of 'B' as shown in the first row)

What I want to do is to aggregate this dataframe grouping by 'FIRST' and concatenating 'SECOND' and 'COUNT' (summing 'COUNT' where 'SECOND' is the same), obtaining something like this:

<table border=1>
<tr><th>FIRST</th><th>SECOND</th><th>COUNT</th></tr>
<tr><td>1</td><td>['A','B','E']</td><td>['3','1','1']</tr>
<tr><td>2</td><td>['C','D','F']</td><td>['3','1','1']</tr>
</table>

with : grp = df.groupby('FIRST').agg(SECOND_LIST=('SECOND',list), CNT_LIST=('CNT',list)).reset_index() I manage to group by FIRST concatenating 'SECOND' and 'COUNT', what do I have to do in order to group by 'SECOND' and sum 'COUNT' values where equal?

Upvotes: 0

Answers (1)

shmakovpn

Reputation: 831

Let's solve the problem step by step, so that it is most clear what is happening. All explanations are given in the comments in the code below. In short, list strings are converted to lists. Lists are combined into dictionaries. When grouping, dictionaries are combined using a specially written function. The result is laid out in the fields, all the excess is cut off.

import pandas as pd

s1 = """\
<table border=1>
<tr><th>FIRST</th><th>SECOND</th><th>COUNT</th></tr>
<tr><td>1</td><td>['A','B']</td><td>['2','1']</tr>
<tr><td>2</td><td>['C','D']</td><td>['1','1']</tr>
<tr><td>1</td><td>['A','E']</td><td>['1','1']</tr>
<tr><td>2</td><td>['C','F']</td><td>['2','1']</tr>
</table>
"""

tables = pd.read_html(s1)
df = tables[0]
# now we have the Pandas Dataframe with columns: 'FIRST', 'SECOND', 'COUNT'
# 'FIRST' contains integers
# 'SECOND' contains strings, not lists of strings
# 'COUNT' contains strings, not lists of integers

import ast
# convert 'SECOND' from string to list of strings
df['SECOND_LIST'] = df['SECOND'].apply(lambda x: ast.literal_eval(x))
# convert 'COUNT' from string to list of integers
df['COUNT_LIST'] = df['COUNT'].apply(lambda x: list(map(int, ast.literal_eval(x))))

from typing import List, Dict, Set  # import support for some type hints


def merge_dicts_sum(dict_list: List[Dict[str, int]]) -> Dict[str, int]:
    """
    merge a list of dicts to one dict, summing values for the same keys
    """
    keys: Set[str] = set()  # set of all unique keys of all dicts in list of dicts
    for dict_item in dict_list:
        keys.update(dict_item.keys())
    result_dict: Dict[str, int] = {}  # we will collect sums of values here
    for key in keys:
        result_dict[key] = 0  #
        for dict_item in dict_list:
            if key in dict_item:
                result_dict[key] += dict_item[key]
    return dict(sorted(result_dict.items()))  # sort result by key then return it

# create a new dataframe by grouping by field 'FIRST' and aggregating dicts from SECOND_COUNT_DICT to one list
df_gr = df.groupby('FIRST').agg(SECOND_COUNT_DICT_LIST=('SECOND_COUNT_DICT',list))

# megre dicts from 'SECOND_COUNT_DICT_LIST' to one dict using function 'merge_dicts_sum'
df_gr['MERGED_DICT'] = df_gr['SECOND_COUNT_DICT_LIST'].apply(merge_dicts_sum)

df_gr['SECOND'] = df_gr['MERGED_DICT'].apply(lambda x: list(x.keys()))  # place keys from 'MERGED_DICT' to field 'SECOND'
df_gr['COUNT'] = df_gr['MERGED_DICT'].apply(lambda x: list(x.values()))  # place values from 'MERGED_DICT' to field 'COUNT'

df_result = df_gr[['SECOND', 'COUNT']]  # create a new dataframe by remaining only 'SECOND' and 'COUNT' fields

# store df_result as html
result_str = df_result.to_html(index=False)

print(f"{result_str}")

Upvotes: 1

Problem with Dataframe Aggregation on Pandas

Answers (1)

Related Questions