Calculating the sum of a column if it contains a string stored in another dataframe

Question

I have a large dataframe (prices) that contains a long description and a price associated to that description. I generated another dataframe (words) that keeps all the unique words that those long descriptions has. What I'm trying to do is calculate the sum of the price of a particular word from the prices dataframe and then store it in the word dataframe, in the same row that the word is.

I got the following solution:

matches = (
    prices['TEXT'].str.extractall(f'({"|".join(words["WORD"])})')
    .rename(columns={0:'WORDS'})
    .rename_axis(['index', 'match'])
)

final = (
    prices.rename_axis('index')
    .join(matches)
    .groupby('WORDS', sort=False)['PRICE'].sum()
    .reset_index(name='SUM_PRICE')
)

but the output is not right for the sum (for example, the total sum of column is 19.6, so no conditional sum should be higher, still it calculates 25.5 for the word 'VINO'):

    WORDS   MEAN_PRICE
0   VINO    25.5
1   ESPUMOSO    20.4
2   ROSE    13.3
3   GRADO   15.7
4   ALCOHOLICO  15.7

example dataframes:


prices = pd.DataFrame({'TEXT': ['VINO ESPUMOSO ROSE GRADO ALCOHOLICO 11.8 ACIDEZ VOLATIL 0.37 COSECHA 2013 EN CAJAS DE 06X750 ML SIN EMBALAR', 'VINO CON DENOMINACION DE ORIGEN ESPUMOSO SPARKLING ANGEL BRUT GRADO ALCOHOLICO 12.0 06BOTELLAS EN ENVASE DE 750 ML SIN EMBALAR', 'VINO ESPUMOSO CHARDONNAY PINOT NOIR EXTRA BR DE UVA, GR.ALC.12.80, ACIDEZ  VOL. 0.46 G/L.,CAJAS DE 6 BOLTELLAS DE 750 ML. SIN EMBALAR', 'VINO PINOT NOIR ROSE BRUT GA 12.0 AV 0.45 COSECHA 2013 CON DENOMINACION DE ORIGEN EN CAJAS CON BOTELLAS DE 6X750CC SIN EMBALAR', 'VINO ESPUMOSO ROSE GRADO ALCOHOLICO 11.8 ACIDEZ VOLATIL 0.37 COSECHA 2013 EN CAJAS DE 06X750 ML SIN EMBALAR VINO CON DENOMINACION DE ORIGEN ESPUMOSO SPARKLING ANGEL BRUT GRADO ALCOHOLICO 12.0 06BOTELLAS EN ENVASE DE 750 ML SIN EMBALAR VINO ESPUMOSO CHARDONNAY PINOT NOIR EXTRA BR DE UVA, GR.ALC.12.80, ACIDEZ  VOL. 0.46 G/L.,CAJAS DE 6 BOLTELLAS DE 750 ML. SIN EMBALAR VINO PINOT NOIR ROSE BRUT GA 12.0 AV 0.45 COSECHA 2013 CON DENOMINACION DE ORIGEN EN CAJAS CON BOTELLAS DE 6X750CC SIN EMBALAR'],
                       'PRICE': [6.33, 5.43, 2.79, 3.07, 1.96]})

words = pd.DataFrame({'WORD':['VINO', 'ESPUMOSO', 'ROSE', 'GRADO', 'ALCOHOLICO', '11.8', 'ACIDEZ', 'VOLATIL', '0.37', 'COSECHA', '2013', 'EN', 'CAJAS', 'DE', '06X750', 'ML', 'SIN', 'EMBALAR', 'CON', 'DENOMINACION', 'ORIGEN', 'SPARKLING', 'ANGEL', 'BRUT', '12.0', '06BOTELLAS', 'ENVASE', '750', 'CHARDONNAY', 'PINOT', 'NOIR', 'EXTRA', 'BR', 'UVA,', 'GR.ALC.12.80,', 'VOL.', '0.46', 'G/L.,CAJAS', '6', 'BOLTELLAS', 'ML.', 'GA', 'AV', '0.45', 'BOTELLAS', '6X750CC', ]})

Thank you very much!

Erfan · Accepted Answer

Since I answered your last question, it's quite easy for me to see the problem. The reason you get a higher sum, is because a word can occur multiple times in a sentence. So use DataFrame.drop_duplicates before GroupBy:

matches = (
    prices['TEXT'].str.extractall(f'({"|".join(words["WORD"])})')
    .rename(columns={0:'WORDS'})
    .rename_axis(['index', 'match'])
)

final = (
    prices.rename_axis('index')
    .join(matches)
    .drop_duplicates()
    .groupby('WORDS', sort=False)['PRICE'].sum()
    .reset_index(name='SUM_PRICE')
)

            WORDS  SUM_PRICE
0            VINO      19.58
1        ESPUMOSO      16.51
2            ROSE      11.36
3           GRADO      13.72
4      ALCOHOLICO      13.72
5            11.8       8.29
6          ACIDEZ      11.08
7         VOLATIL       8.29
8            0.37       8.29
9         COSECHA      11.36
10           2013      11.36
11             EN      16.79
12          CAJAS      11.36
13             DE      19.58
14         06X750       8.29
15             ML      16.51
16            SIN      19.58
17        EMBALAR      19.58
18            CON      10.46
19         ORIGEN      10.46
20      SPARKLING       7.39
21          ANGEL       7.39
22           BRUT      10.46
23           12.0      10.46
24     06BOTELLAS       7.39
25            750      13.25
26     CHARDONNAY       4.75
27          PINOT       7.82
28           NOIR       7.82
29          EXTRA       4.75
30             BR       4.75
31           UVA,       4.75
32  GR.ALC.12.80,       4.75
33           VOL.       4.75
34           0.46       4.75
35     G/L.,CAJAS       4.75
36              6       7.82
37      BOLTELLAS       4.75
38             GA       5.03
39             AV       5.03
40           0.45       5.03
41       BOTELLAS       5.03

Calculating the sum of a column if it contains a string stored in another dataframe

Answers (1)

Related Questions