Trying to filter out a column in a dataframe by applying conditions on some other column

Question

I have 3 columns in a csv file : account_id , game_variant , no_of_games.... table looks something like this


account_id    game_variant   no_of_games
130               a             2
145               c             1
130               b             4
130               c             1
142               a             3
140               c             2
145               b             5

So, i want to extract no of games played in variant a,b,c,a∩b,b∩c,a∩c,a∩b∩c

I was able to extract games played in a,b,c individually by grouping with game_variant and doing sum over no_of_games but unable to logically put in the intersections part. Pls help me with that

data_agg = df.groupby(['game_variant']).agg({'no_of_games':[np.sum]})

Thanks in advance

Jakob · Accepted Answer

Here the solution which will return the intersection based on a per-player level. This uses additionally the defaultdict, as this is very convenient for that case. I'll explain the code inline

from itertools import combinations
import pandas
from collections import defaultdict
from pprint import pprint  # only needed for pretty printing of dictionary

df = pandas.read_csv('df.csv', sep='\s+')  # assuming the data frame is in a file df.csv

# group by account_id to get subframes which only refer to one account.
data_agg2 = df.groupby(['account_id'])

# a defaultdict is a dictionary, where when no key is present, the function defined
# is used to create the element. This eliminates the check, if a key is
# already present or to set all combinations in advance.
games_played_2 = defaultdict(int)

# iterate over all accounts
for el in data_agg2.groups:
    # extract the sub-dataframe from the gouped function
    tmp = data_agg2.get_group(el)
    # print(tmp)  # you can uncomment this to see each account
    
    # This is in principle the same loop as suggested before. However, as not every
    # player has played all variants, one only has to create the number of combinations
    # necessary for that player
    for i in range(len(tmp.loc[:, 'no_of_games'])):
        # As now the game_variant is a column and not the index, the first part of zip
        # is slightly adapted. This loops over all combinations of variants for the
        # current account.
        for comb, combsum in zip(combinations(tmp.loc[:, 'game_variant'], i+1), combinations(tmp.loc[:, 'no_of_games'].values, i+1)):
            # Here, each variant combination gets a unique key. Comb is sorted, as the
            # variants might be not in alphabetic order. The number of games played for
            # each variant for that player are added to the value of all players before.
            games_played_2['_'.join(sorted(comb))] += sum(combsum)

pprint (games_played_2)

# returns
>> defaultdict(,
            {'a': 5,
             'a_b': 6,
             'a_b_c': 7,
             'a_c': 3,
             'b': 9,
             'b_c': 11,
             'c': 4})

As you already extracted the number of games played by their variants, you could simply add them up. If you want to do that automatically, you could use itertools.combinations on that in a loop, which iterates over all possible combination lengths:

from itertools import combinations
import pandas
import numpy as np
from pprint import pprint  # only needed for pretty printing of dictionary

df = pandas.read_csv('df.csv', sep='\s+')  # assuming the data frame is in a file df.csv

data_agg = df.groupby(['game_variant']).agg({'no_of_games':[np.sum]})

games_played = {}

for i in range(len(data_agg.loc[:, 'no_of_games'])):
    for comb, combsum in zip(combinations(data_agg.index, i+1), combinations(data_agg.loc[:, 'no_of_games'].values, i+1)):
        games_played['_'.join(comb)] = sum(combsum)

pprint(games_played)

returns:

>> {'a': array([5], dtype=int64),
>>  'a_b': array([14], dtype=int64),
>>  'a_b_c': array([18], dtype=int64),
>>  'a_c': array([9], dtype=int64),
>>  'b': array([9], dtype=int64),
>>  'b_c': array([13], dtype=int64),
>>  'c': array([4], dtype=int64)}

'combinations(sequence, number)' returns an iterator of all combinations of number elements in sequence. Therefore, to get all possible combinations, you have to iterate over all numbers from 1 to len(sequence. This is what the first for loop does.

The next for loop is composed of two iterators: One over the index of the aggregated data (combinations(data_agg.index, i+1)), one over the actual number of games played in each variant (combinations(data_agg.loc[:, 'no_of_games'].values, i+1)). So comb should always be a list of the variants, and combsum a list of the number of games played per variant. Note here, that to get all values you have to use .loc[:, 'no_games'], not .loc['no_games'], as the latter searches for an index named 'no_games', while it is a column name.

Then, I set the key of the dictionary to the combined string of the variant-list, and the value to the sum of the elements of the number of games played.

Trying to filter out a column in a dataframe by applying conditions on some other column

Answers (1)

Related Questions