Reputation: 3
I have two Dataframes.
The first df consists of records with a list of terms classifying each record, and an 'Industry' column I'm trying to assign values for.
df1 = pd.DataFrame([['a', ['Elementary', 'Pre Schools', 'High Schools'], None], ['b', ['Museums'], None], ['c', ['Junior High Schools', 'Military - Misc.', 'Roads', 'Sewers and Water Mains'], None]], columns=['Record Name', 'Terms', 'Industry'])
df1 looks like
Record Name Terms Industry
0 a [Elementary, Pre Schools, High Schools] None
1 b [Museums] None
2 c [Junior High Schools, Military - Misc., Roads, Sewers and Water Mains] None
The second df provides lists of terms and their corresponding 'Industry'.
df2 = pd.DataFrame([['Arts', ['Libraries', 'Museums', 'Auditoriums', 'Entertainment']], ['Education', ['Elementary', 'Pre Schools', 'College', 'University', 'Junior High Schools', 'High Schools']]])
d2 looks like
Industry Terms
0 Arts [Libraries, Museums, Auditoriums, Entertainment]
1 Education [Elementary, Pre Schools, College, University, Junior High Schools, High Schools]
df2 contains 10 different Industries with about 100 terms spread across them. Not every term present in df1 is used to categorize an Industry in df2.
How would I assign 'Industry' values in df1 based on any common 'Term' list elements between df1 and df2? Here's the desired output:
Record Name Terms Industry
0 a [Elementary, Pre Schools, High Schools] Education
1 b [Museums] Arts
2 c [Junior High Schools, Military - Misc., Roads, Sewers and Water Mains] Education
I've tried just using a series of lists for each Industry and its respective terms, so that I can assign using .isin()
df1.loc[df1['Terms'].isin(Education), 'Industry'] = 'Education'
But this only assigns the 'Industry' for records that have all their 'Terms' present in the Education list. Also, since there's 10 different industries, defining lists and creating individual assignments for each one would be cumbersome. I believe assessing common 'Terms' between two dataframes would be a cleaner approach.
Upvotes: 0
Views: 75
Reputation: 1075
I find it difficult to work with lists in dataframes so I first explode the lists, then merge the dataframes and put the lists back together.
#This just helps me keep the columns straight
df2.columns = ['Industry', 'Terms']
# Drop Industry in df1 and explode the lists in df1 & df2
df1 = df1.drop('Industry', axis=1).explode('Terms')
df2 = df2.explode('Terms')
# Merge the terms back into df1
df1 = df1.merge(df2, how='left', on='Terms')
# Bring the lists back together
df1 = df1.groupby('Record Name').agg(lambda x: x.dropna().unique().tolist())
The result is:
Terms Industry
Record Name
a [Elementary, Pre Schools, High Schools] [Education]
b [Museums] [Arts]
c [Junior High Schools, Military - Misc., Roads,... [Education]
Upvotes: 1