Break pandas DataFrame column into multiple pieces and combine with other DataFrame

Question

I have a table of phrases and I have a table of individual words that make up these phrases. I want to break my phrases up into individual words, gather and reduce information about these individual words and add as a new column in my phrase data. Is there a smart way to do this using pandas DataFrames?

    df_multigram = pd.DataFrame([
        ["happy birthday", 23],
        ["used below", 10],
        ["frame for", 2]
    ], columns=["multigram", "frequency"])
    df_onegram = pd.DataFrame([
        ["happy", 35],
        ["birthday", 25],
        ["used", 14],
        ["below", 11],
        ["frame", 2],
        ["for", 13]
    ], columns=["onegram", "frequency"])

    ###### What do I do here????? #######

    sum_freq_onegrams = list(df_multigram["sum_freq_onegrams"])
    self.assertEqual(sum_freq_onegrams, [60, 25, 15])

Just to clarify, my desire is that sum_freq_onegrams is equal to [60, 25, 15], where 60 is the frequency of "happy" plus the frequency of "birthday".

unutbu · Accepted Answer

You could use

freq = df_onegram.set_index(['onegram'])['frequency']
sum_freq_onegrams = df_multigram['multigram'].str.split().apply(
    lambda x: pd.Series(x).map(freq).sum())

which yields

In [43]: sum_freq_onegrams
Out[45]: 
0    60
1    25
2    15
Name: multigram, dtype: int64

But note that calling a (lambda) function once for every row and building a new (tiny) Series each time may be rather slow. Using a different data structure -- even plain Python lists and dicts -- may be faster. For example, if you defined the list phrases and the dict freq_dict,

phrases = df_multigram['multigram'].tolist()
freq_dict = freq.to_dict()

then the list comprehension (below) is 280x faster than the Pandas-based method:

In [65]: [sum(freq_dict.get(item, 0) for item in phrase.split()) for phrase in phrases]
Out[65]: [60, 25, 15]

In [38]: %timeit [sum(freq_dict.get(item, 0)for item in phrase.split()) for phrase in phrases]
100000 loops, best of 3: 3.6 µs per loop

In [41]: %timeit df_multigram['multigram'].str.split().apply(lambda x: pd.Series(x).map(freq).sum())
1000 loops, best of 3: 1.01 ms per loop

Thus, using a Pandas DataFrame here to hold the phrases might not be the right data structure for this problem.

Break pandas DataFrame column into multiple pieces and combine with other DataFrame

Answers (2)

Related Questions