Reputation: 2333
I have a dataframe df with two columns as follows:
ID Country_pairs
0 X [(France, USA), (USA, France)]
1 Y [(USA, UK), (UK, France), (USA, France)]
I want to output all possible pairs of countries in two columns as follows:
ID Country1 Country2
X France USA
X USA France
Y USA UK
Y UK France
Y USA France
Doing this gives me the output I want:
result = pd.DataFrame()
for index, row in df.iterrows():
x = row['Country_pairs']
temp = pd.DataFrame(data=x, columns=['Country1','Country2'])
temp['PID'] = row['PID']
result = result.append(temp)
print result
The dataframe is over 200 million rows, so this is a very slow since I'm looping. I was wondering if there is a faster solution?
Upvotes: 3
Views: 1166
Reputation: 28644
200 million rows is massive, no point running this computation on Pandas. As suggested in the comments, use Apache spark. or if it's in a database, u could possibly work something there.
The solution I proffer works on small datasets suited to Pandas ... take the data out of Pandas, use the itertools functions - product and chain, and build back the dataframe. It should be reasonably fast, but certainly not for 200 million rows
#using the data provided by @Trenton
df = pd.DataFrame({'ID': ['X', 'Y'],
'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
[('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})
from itertools import product, chain
step1 = chain.from_iterable(product(first, last)
for first, last in
df.to_numpy())
res = pd.DataFrame(((first, *last) for first, last in step1),
columns=['ID', 'Country1', 'Country2'])
res
ID Country1 Country2
0 X France USA
1 X USA France
2 Y USA UK
3 Y UK France
4 Y USA France
Upvotes: 1
Reputation: 62393
import pandas as pd
# setup the dataframe
df = pd.DataFrame({'ID': ['X', 'Y'],
'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
[('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})
ID Country_Pairs
0 X [(France, USA), (USA, France)]
1 Y [(USA, UK), (UK, France), (USA, France)]
# separate each tuple to its own row with explode
df2 = df.explode('Country_Pairs')
# separate each value in the tuple to its own column
df2[['Country1', 'Counrtry2']] = pd.DataFrame(df2.Country_Pairs.tolist(), index=df2.index)
# delete Country_Pairs
df2.drop(columns=['Country_Pairs'], inplace=True)
ID Country1 Counrtry2
0 X France USA
0 X USA France
1 Y USA UK
1 Y UK France
1 Y USA France
Upvotes: 1
Reputation: 1690
You are looking for .explode()
result = df.explode('Country_pairs')
result["Country1"] = result.Country_pairs.apply(lambda t:t[0])
result["Country2"] = result.Country_pairs.apply(lambda t:t[1])
del result["Country_pairs"]
Upvotes: 1
Reputation: 22031
give a trial at this:
explode_df = df.apply(pd.Series.explode)
split_country = explode_df.apply(lambda x: ' '.join(x['Country_pairs']), axis=1).str.split(expand=True)
# whether you would like to combine the results
res = pd.concat([explode_df, split_country], axis=1)
Upvotes: 1