BKS
BKS

Reputation: 2333

converting a list of pairs in a column into two columns using pandas dataframe

I have a dataframe df with two columns as follows:

    ID  Country_pairs
0   X   [(France, USA), (USA, France)]
1   Y   [(USA, UK), (UK, France), (USA, France)]

I want to output all possible pairs of countries in two columns as follows:

ID  Country1    Country2
X   France      USA
X   USA         France
Y   USA         UK
Y   UK          France
Y   USA         France

Doing this gives me the output I want:

   result = pd.DataFrame() 
   for index, row in df.iterrows():
        x = row['Country_pairs']
        temp = pd.DataFrame(data=x, columns=['Country1','Country2'])
        temp['PID'] = row['PID']
        result = result.append(temp)
    print result

The dataframe is over 200 million rows, so this is a very slow since I'm looping. I was wondering if there is a faster solution?

Upvotes: 3

Views: 1166

Answers (4)

sammywemmy
sammywemmy

Reputation: 28644

200 million rows is massive, no point running this computation on Pandas. As suggested in the comments, use Apache spark. or if it's in a database, u could possibly work something there.

The solution I proffer works on small datasets suited to Pandas ... take the data out of Pandas, use the itertools functions - product and chain, and build back the dataframe. It should be reasonably fast, but certainly not for 200 million rows

#using the data provided by @Trenton
df = pd.DataFrame({'ID': ['X', 'Y'],
                   'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
                                     [('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})

from itertools import product, chain
step1 = chain.from_iterable(product(first, last) 
                            for first, last in
                            df.to_numpy())
res = pd.DataFrame(((first, *last) for first, last in step1),
                   columns=['ID', 'Country1', 'Country2'])

res


   ID   Country1    Country2
0   X   France         USA
1   X   USA          France
2   Y   USA             UK
3   Y   UK           France
4   Y   USA          France

Upvotes: 1

Trenton McKinney
Trenton McKinney

Reputation: 62393

import pandas as pd

# setup the dataframe
df = pd.DataFrame({'ID': ['X', 'Y'],
                   'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
                                     [('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})

  ID                             Country_Pairs
0  X            [(France, USA), (USA, France)]
1  Y  [(USA, UK), (UK, France), (USA, France)]

# separate each tuple to its own row with explode
df2 = df.explode('Country_Pairs')

# separate each value in the tuple to its own column
df2[['Country1', 'Counrtry2']] = pd.DataFrame(df2.Country_Pairs.tolist(), index=df2.index)

# delete Country_Pairs
df2.drop(columns=['Country_Pairs'], inplace=True)

  ID Country1 Counrtry2
0  X   France       USA
0  X      USA    France
1  Y      USA        UK
1  Y       UK    France
1  Y      USA    France

Upvotes: 1

Adrien Matissart
Adrien Matissart

Reputation: 1690

You are looking for .explode()

result = df.explode('Country_pairs')
result["Country1"] = result.Country_pairs.apply(lambda t:t[0])
result["Country2"] = result.Country_pairs.apply(lambda t:t[1])
del result["Country_pairs"]

Upvotes: 1

Marco Cerliani
Marco Cerliani

Reputation: 22031

give a trial at this:

explode_df = df.apply(pd.Series.explode)
split_country = explode_df.apply(lambda x: ' '.join(x['Country_pairs']), axis=1).str.split(expand=True)

# whether you would like to combine the results
res = pd.concat([explode_df, split_country], axis=1)

Upvotes: 1

Related Questions