Reputation: 191
I'm trying to reshape a dataframe to get it into a more usable structure for graphing and right now what I've been doing is reshaping the df using iterrows or itertuples, following What is the most efficient way to loop through dataframes with pandas?
Below is a overly simplified dataset, but the real dataset will have tens of thousands of more rows.
group subtopic code
fruit grapes 110A
fruit apple 110B
meat pork 220A
meat chicken 220B
meat duck 220C
vegetable lettuce 300A
vegetable tomato 310A
vegetable asparagus 320A
Basically, I want to create a new column ("code2") based on whether column ("code") shares the same value in column "group".
I tried running the following code:
df = pd.read_excel(file1, sheetname = 'Sheet3')
def reshape_iterrows(df):
reshape = []
for i, j, in df.iterrows():
for _, k in df.iterrows():
if (j['code'] == k['code']):
pass
elif j['group'] == 'nan':
reshape.append({'code1':j['code'],
'code2': j['code'],
'group': 'None'})
elif (j['group'] == k['group']):
reshape.append({'code1': j['code'],
'code2': k['code'],
'group': j['group']})
else:
pass
return reshape
reshape_iterrows(df)
or using itertuples:
def reshape_iterrows(df):
reshape = []
for row1 df.itertuples():
for row2 in df.itertuples():
if (row1[3] == row2[3]):
pass
elif row1[1] == 'nan':
reshape.append({'code1':row1[3],
'code2': row1[3],
'group': 'None'})
elif (row1[1] == row2[1]):
reshape.append({'code1': row1[3],
'code2': row2[3],
'group': row1[1]})
else:
pass
return reshape
I pass reshape to pd.DataFrame() and the expected output is below, which I then use the code1 and code2 columns as the source and target parameters within nx.from_pandas_edgelist to generate a graph.
code1 code2 group
0 110A 110B fruit
1 110B 110A fruit
2 220A 220B meat
3 220A 220C meat
4 220B 220A meat
5 220B 220C meat
6 220C 220A meat
7 220C 220B meat
8 300A 300B vegetable
9 300A 300C vegetable
10 300B 300A vegetable
11 300B 300C vegetable
12 300C 300A vegetable
13 300C 300B vegetable
Like others I'm interested in finding a more efficient way to iterate perhaps using Numpy's boolean operations? Looking for guidance on how I would approach getting the same result using vectorized/array operations.
Thanks!
Upvotes: 1
Views: 186
Reputation: 153460
You can try:
from itertools import permutations
df.groupby('group')['code']\
.apply(lambda x: pd.DataFrame(list(permutations(x.tolist(),2))))\
.add_prefix('code').reset_index().drop('level_1',axis=1)
Output:
group code0 code1
0 fruit 110A 110B
1 fruit 110B 110A
2 meat 220A 220B
3 meat 220A 220C
4 meat 220B 220A
5 meat 220B 220C
6 meat 220C 220A
7 meat 220C 220B
8 vegetable 300A 310A
9 vegetable 300A 320A
10 vegetable 310A 300A
11 vegetable 310A 320A
12 vegetable 320A 300A
13 vegetable 320A 310A
Upvotes: 2
Reputation: 3306
It might not be the most efficient, but here is what I have tried. I put too much efforts into it to just let my answer go to waste :)
An upside of my answer would be that all the steps are clear. And if you need to do something in between (or realize you just want the names, and not the codes, you can just comment a line for instance).
import pandas as pd
from itertools import permutations
def get_data():
return {
'group' : [
'fruit', 'fruit',
'meat', 'meat', 'meat',
'vegetable', 'vegetable', 'vegetable'
],
'subtopic' : [
'grapes', 'apple',
'pork', 'chicken', 'duck',
'lettuce', 'tomato', 'asparagus'
],
'code' : [
'110A', '110B',
'220A', '220B', '220C',
'300A', '310A', '320A'
]
}
# Used to retrieve code for specific item
def make_code_map(df):
return dict(df[['subtopic', 'code']].to_dict('split')['data'])
# Used to retrieve group for specific item.
def make_group_map(df):
return dict(df[['subtopic', 'group']].to_dict('split')['data'])
if __name__ == '__main__':
df = pd.DataFrame(get_data())
mapping = make_code_map(df)
group_map = make_group_map(df)
graph_edges = []
for name, group in df.groupby('group'):
graph_edges.extend( permutations(group['subtopic'].tolist(), 2) )
ndf = pd.DataFrame(graph_edges, columns=['code1', 'code2'])
# Applying the group map to get all the correct groups for each
# item.
ndf['group'] = ndf['code1'].apply(lambda x:group_map[x])
# Replace each item with its corresponding code.
ndf = ndf.replace(mapping)
print(ndf)
# code1 code2 group
# 0 110A 110B fruit
# 1 110B 110A fruit
# 2 220A 220B meat
# 3 220A 220C meat
# 4 220B 220A meat
# 5 220B 220C meat
# 6 220C 220A meat
# 7 220C 220B meat
# 8 300A 310A vegetable
# 9 300A 320A vegetable
# 10 310A 300A vegetable
# 11 310A 320A vegetable
# 12 320A 300A vegetable
# 13 320A 310A vegetable
Upvotes: 2