Reputation: 45
I'm new to doing parallel processing in Python. I have a large dataframe with names and the list of countries that the person lived in. A sample dataframe is this:
I have a chunk of code that takes in this dataframe and splits the countries to separate columns. The code is this:
def split_country(data):
d_list = []
for index, row in data.iterrows():
for value in str(row['Country']).split(','):
d_list.append({'Name':row['Name'],
'value':value})
data = data.append(d_list, ignore_index=True)
data = data.groupby('Name')['value'].value_counts()
data = data.unstack(level=-1).fillna(0)
return (data)
The final output is something like this:
I'm trying to parallelize the above process by passing my dataframe (df) using the following:
import multiprocessing import Pool
result = []
pool = mp.Pool(mp.cpu_count())
result.append(pool.map(split_country, [row for row in df])
But the processing does not stop even with a toy dataset like the above. I'm completely new to this, so would appreciate any help
Upvotes: 0
Views: 270
Reputation: 62533
multiprocessing
is probably not required here. Using pandas
vectorized methods will be sufficient to quickly produce the desired result.
pandas.DataFrame.explode
on the column of lists
ast.literal_eval
to convert them to list
type
df.countries = df.countries.apply(ast.literal_eval)
df = pd.read_csv('test.csv', converters={'countries': literal_eval})
pandas.get_dummies
to get a count of each country per name, then pandas.DataFrame.groupby
on 'name'
, and aggregate with .sum
import pandas as pd
from ast import literal_eval
# sample data
data = {'name': ['John', 'Jack', 'James'], 'countries': [['USA', 'UK'], ['China', 'UK'], ['Canada', 'USA']]}
# create the dataframe
df = pd.DataFrame(data)
# if the countries column is strings, evaluate to lists; otherwise skip this line
df.countries = df.countries.apply(literal_eval)
# explode the lists
df = df.explode('countries')
# use get_dummies and groupby name and sum
df_counts = pd.get_dummies(df, columns=['countries'], prefix_sep='', prefix='').groupby('name', as_index=False).sum()
# display(df_counts)
name Canada China UK USA
0 Jack 0 1 1 0
1 James 1 0 0 1
2 John 0 0 1 1
Upvotes: 1