Steve
Steve

Reputation: 331

Python pandas dataframe get all combinations of column values?

I have a pandas dataframe which looks like this:

        colour   points
0         red         1
1      yellow        10
2       black        -3

Then I'm trying to do the following algorithm:

combos = []
points = []

for i1 in range(len(df)):
  for i2 in range(len(df)):

    colour_main      = df['colour'].values[i1]
    colour_secondary = df['colour'].values[i2]
    combo = colour_main + "_" + colour_secondary

    point1 = df['points'].values[i1]
    point2 = df['points'].values[i2]
    new_points = point1 + point2

    combos.append(combo)
    points.append(new_points)

df_new = pd.DataFrame({'colours': combos,
                       'points': points})

print(df_new)

I want to get all combinations and sum points:

Example:

red_yellow =  1 + (-10) = -9
red_black  =  1 + ( +3) =  4
black_red  = -3 + ( -1) = -4

The output I currently get:

         colours  points
0        red_red       2
1     red_yellow      11
2      red_black      -2
3     yellow_red      11
4  yellow_yellow      20
5   yellow_black       7
6      black_red      -2
7   black_yellow       7
8    blac_kblack      -6

The output I'm looking for:

red_yellow       -9
red_black         4
yellow_red        9
yellow_black     13
black_red        -4
black_yellow    -13

I don't know how to apply my logic to this code, also I bet there is a more simplest way to get all combinations without doing two loops, but currently, that's the only thing that comes to my mind.

I would like to:

Upvotes: 3

Views: 1803

Answers (4)

unutbu
unutbu

Reputation: 879561

Here is a timeit comparison of a few alternatives.

| method             | ms per loop |
|--------------------+-------------|
| alt2               |        2.36 |
| using_concat       |        3.26 |
| using_double_merge |        22.4 |
| orig               |        22.6 |
| alt                |        45.8 |

The timeit results were generated using IPython:

In [138]: df = make_df(20)

In [143]: %timeit alt2(df)
100 loops, best of 3: 2.36 ms per loop
In [140]: %timeit orig(df)
10 loops, best of 3: 22.6 ms per loop
In [142]: %timeit alt(df)
10 loops, best of 3: 45.8 ms per loop
In [169]: %timeit using_double_merge(df)
10 loops, best of 3: 22.4 ms per loop
In [170]: %timeit using_concat(df)
100 loops, best of 3: 3.26 ms per loop

import numpy as np
import pandas as pd


def alt(df):
    df['const'] = 1
    result = pd.merge(df, df, on='const', how='outer')
    result = result.loc[(result['colour_x'] != result['colour_y'])]

    result['color'] = result['colour_x'] + '_' + result['colour_y']
    result['points'] = result['points_x'] - result['points_y']
    result = result[['color', 'points']]
    return result

def alt2(df):
     points = np.add.outer(df['points'], -df['points'])
     color = pd.MultiIndex.from_product([df['colour'], df['colour']])
     mask = color.labels[0] != color.labels[1]
     color = color.map('_'.join)
     result = pd.DataFrame({'points':points.ravel(), 'color':color})
     result = result.loc[mask]
     return result

def orig(df):
    combos = []
    points = []

    for i1 in range(len(df)):
        for i2 in range(len(df)):
            colour_main = df['colour'].iloc[i1]
            colour_secondary = df['colour'].iloc[i2]
            if colour_main != colour_secondary:
                combo = colour_main + "_" + colour_secondary

                point1 = df['points'].values[i1]
                point2 = df['points'].values[i2]
                new_points = point1 - point2

                combos.append(combo)
                points.append(new_points)

    return pd.DataFrame({'color':combos, 'points':points})

def using_concat(df):
    """https://stackoverflow.com/a/51641085/190597 (RafaelC)"""
    d = df.set_index('colour').to_dict()['points']
    s = pd.Series(list(itertools.combinations(df.colour, 2)))
    s = pd.concat([s, s.transform(lambda k: k[::-1])])
    v = s.map(lambda k: d[k[0]] - d[k[1]])
    df2 = pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
    return df2

def using_double_merge(df):
    """https://stackoverflow.com/a/51641007/190597 (sacul)"""    
    new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
           .reset_index()
           .drop(['colour', 'points'], 1)
           .merge(df.set_index('colour'), left_on='level_0', right_index=True)
           .merge(df.set_index('colour'), left_on='level_1', right_index=True))

    new['points_y'] *= -1
    new['sum'] = new.sum(axis=1)
    new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
    new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
    return new[['colours', 'sum']]

def make_df(N):
    df = pd.DataFrame({'colour': np.arange(N), 
                       'points': np.random.randint(10, size=N)})
    df['colour'] = df['colour'].astype(str)
    return df

The main idea in alt2 is to use np.add_outer to construct an addition table out of df['points']:

In [149]: points = np.add.outer(df['points'], -df['points'])
In [151]: points
Out[151]: 
array([[  0,  -9,   4],
       [  9,   0,  13],
       [ -4, -13,   0]])

ravel is used to make the array 1-dimensional:

In [152]: points.ravel()
Out[152]: array([  0,  -9,   4,   9,   0,  13,  -4, -13,   0])

and the color combinations are generated with pd.MultiIndex.from_product:

In [153]: color = pd.MultiIndex.from_product([df['colour'], df['colour']])
In [155]: color = color.map('_'.join)
In [156]: color
Out[156]: 
Index(['red_red', 'red_yellow', 'red_black', 'yellow_red', 'yellow_yellow',
       'yellow_black', 'black_red', 'black_yellow', 'black_black'],
      dtype='object')

A mask is generated to remove duplicates:

mask = color.labels[0] != color.labels[1]

and then the result is generated from these parts:

 result = pd.DataFrame({'points':points.ravel(), 'color':color})
 result = result.loc[mask]

The idea behind alt is explained in my original answer, here.

Upvotes: 4

rafaelc
rafaelc

Reputation: 59274

d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2= pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})

    comb             values
0   red_yellow       -9
1   red_black        4
2   yellow_black     13
0   yellow_red       9
1   black_red        -4
2   black_yellow    -13

Upvotes: 2

sacuL
sacuL

Reputation: 51335

This is a bit long-winded, but gets you the output you want:

new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
       .reset_index()
       .drop(['colour', 'points'], 1)
       .merge(df.set_index('colour'), left_on='level_0', right_index=True)
       .merge(df.set_index('colour'), left_on='level_1', right_index=True))


new['points_x'] *= -1

new['sum'] = new.sum(axis=1)

new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)

new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)


>>> new
  level_0 level_1  sum       colours
3  yellow     red   -9    yellow_red
6   black     red    4     black_red
1     red  yellow    9    red_yellow
7   black  yellow   13  black_yellow
2     red   black   -4     red_black
5  yellow   black  -13  yellow_black

Upvotes: 2

Interested_Programmer
Interested_Programmer

Reputation: 322

You have to change this line in your code

new_points = point1 + point2

to this

new_points = point1 - point2

Upvotes: 1

Related Questions