Reputation: 8054
I have a table of phrases and I have a table of individual words that make up these phrases. I want to break my phrases up into individual words, gather and reduce information about these individual words and add as a new column in my phrase data. Is there a smart way to do this using pandas DataFrames?
df_multigram = pd.DataFrame([
["happy birthday", 23],
["used below", 10],
["frame for", 2]
], columns=["multigram", "frequency"])
df_onegram = pd.DataFrame([
["happy", 35],
["birthday", 25],
["used", 14],
["below", 11],
["frame", 2],
["for", 13]
], columns=["onegram", "frequency"])
###### What do I do here????? #######
sum_freq_onegrams = list(df_multigram["sum_freq_onegrams"])
self.assertEqual(sum_freq_onegrams, [60, 25, 15])
Just to clarify, my desire is that sum_freq_onegrams is equal to [60, 25, 15], where 60 is the frequency of "happy" plus the frequency of "birthday".
Upvotes: 0
Views: 300
Reputation: 879471
You could use
freq = df_onegram.set_index(['onegram'])['frequency']
sum_freq_onegrams = df_multigram['multigram'].str.split().apply(
lambda x: pd.Series(x).map(freq).sum())
which yields
In [43]: sum_freq_onegrams
Out[45]:
0 60
1 25
2 15
Name: multigram, dtype: int64
But note that calling a (lambda) function once for every row and building a new (tiny) Series each time may be rather slow. Using a different data structure -- even plain Python lists and dicts -- may be faster. For example, if you defined the list phrases
and the dict freq_dict
,
phrases = df_multigram['multigram'].tolist()
freq_dict = freq.to_dict()
then the list comprehension (below) is 280x faster than the Pandas-based method:
In [65]: [sum(freq_dict.get(item, 0) for item in phrase.split()) for phrase in phrases]
Out[65]: [60, 25, 15]
In [38]: %timeit [sum(freq_dict.get(item, 0)for item in phrase.split()) for phrase in phrases]
100000 loops, best of 3: 3.6 µs per loop
In [41]: %timeit df_multigram['multigram'].str.split().apply(lambda x: pd.Series(x).map(freq).sum())
1000 loops, best of 3: 1.01 ms per loop
Thus, using a Pandas DataFrame here to hold the phrases might not be the right data structure for this problem.
Upvotes: 3
Reputation: 394031
There is probably a better way to do this but this works:
In [131]:
def func(x):
total = 0
for w in x.split():
if len(df_onegram[df_onegram['onegram'] == w]) > 0:
total += df_onegram[df_onegram['onegram'] == w]['frequency'].values[0]
return total
df_multigram['total_freq'] = df_multigram['multigram'].apply(lambda x: func(x))
df_multigram
Out[131]:
multigram frequency total_freq
0 happy birthday 23 60
1 used below 10 25
2 frame for 2 15
Upvotes: 1