Iterating DataFrame Rows to Create New Column Using Numpy

Question

I'm trying to reshape a dataframe to get it into a more usable structure for graphing and right now what I've been doing is reshaping the df using iterrows or itertuples, following What is the most efficient way to loop through dataframes with pandas?

Below is a overly simplified dataset, but the real dataset will have tens of thousands of more rows.

group    subtopic    code
fruit    grapes      110A
fruit    apple       110B
meat     pork        220A
meat     chicken     220B
meat     duck        220C
vegetable lettuce    300A
vegetable tomato     310A
vegetable asparagus  320A

Basically, I want to create a new column ("code2") based on whether column ("code") shares the same value in column "group".

I tried running the following code:

df = pd.read_excel(file1, sheetname = 'Sheet3')

def reshape_iterrows(df):
    reshape = []

    for i, j, in df.iterrows():
        for _, k in df.iterrows():
            if (j['code'] == k['code']):
                pass
            elif j['group'] == 'nan':
                reshape.append({'code1':j['code'],
                       'code2': j['code'],
                       'group': 'None'})
            elif (j['group'] == k['group']):
                reshape.append({'code1': j['code'],
                       'code2': k['code'],
                       'group': j['group']})
            else:
                pass
        return reshape

reshape_iterrows(df)

or using itertuples:

def reshape_iterrows(df):
    reshape = []

    for row1 df.itertuples():
        for row2 in df.itertuples():
            if (row1[3] == row2[3]):
                pass
            elif row1[1] == 'nan':
                reshape.append({'code1':row1[3],
                       'code2': row1[3],
                       'group': 'None'})
            elif (row1[1] == row2[1]):
                reshape.append({'code1': row1[3],
                       'code2': row2[3],
                       'group': row1[1]})
            else:
                pass
        return reshape

I pass reshape to pd.DataFrame() and the expected output is below, which I then use the code1 and code2 columns as the source and target parameters within nx.from_pandas_edgelist to generate a graph.

    code1   code2   group
0   110A    110B    fruit
1   110B    110A    fruit
2   220A    220B    meat
3   220A    220C    meat
4   220B    220A    meat
5   220B    220C    meat
6   220C    220A    meat
7   220C    220B    meat
8   300A    300B    vegetable
9   300A    300C    vegetable
10  300B    300A    vegetable
11  300B    300C    vegetable
12  300C    300A    vegetable
13  300C    300B    vegetable

Like others I'm interested in finding a more efficient way to iterate perhaps using Numpy's boolean operations? Looking for guidance on how I would approach getting the same result using vectorized/array operations.

Thanks!

Scott Boston · Accepted Answer

You can try:

from itertools import permutations
df.groupby('group')['code']\
  .apply(lambda x: pd.DataFrame(list(permutations(x.tolist(),2))))\
  .add_prefix('code').reset_index().drop('level_1',axis=1)

Output:

        group code0 code1
0       fruit  110A  110B
1       fruit  110B  110A
2        meat  220A  220B
3        meat  220A  220C
4        meat  220B  220A
5        meat  220B  220C
6        meat  220C  220A
7        meat  220C  220B
8   vegetable  300A  310A
9   vegetable  300A  320A
10  vegetable  310A  300A
11  vegetable  310A  320A
12  vegetable  320A  300A
13  vegetable  320A  310A

Iterating DataFrame Rows to Create New Column Using Numpy

Answers (2)

Related Questions