Reputation: 159
I have a large dataframe (prices) that contains a long description and a price associated to that description. I generated another dataframe (words) that keeps all the unique words that those long descriptions has. What I'm trying to do is calculate the sum of the price of a particular word from the prices dataframe and then store it in the word dataframe, in the same row that the word is.
I got the following solution:
matches = (
prices['TEXT'].str.extractall(f'({"|".join(words["WORD"])})')
.rename(columns={0:'WORDS'})
.rename_axis(['index', 'match'])
)
final = (
prices.rename_axis('index')
.join(matches)
.groupby('WORDS', sort=False)['PRICE'].sum()
.reset_index(name='SUM_PRICE')
)
but the output is not right for the sum (for example, the total sum of column is 19.6, so no conditional sum should be higher, still it calculates 25.5 for the word 'VINO'):
WORDS MEAN_PRICE
0 VINO 25.5
1 ESPUMOSO 20.4
2 ROSE 13.3
3 GRADO 15.7
4 ALCOHOLICO 15.7
example dataframes:
prices = pd.DataFrame({'TEXT': ['VINO ESPUMOSO ROSE GRADO ALCOHOLICO 11.8 ACIDEZ VOLATIL 0.37 COSECHA 2013 EN CAJAS DE 06X750 ML SIN EMBALAR', 'VINO CON DENOMINACION DE ORIGEN ESPUMOSO SPARKLING ANGEL BRUT GRADO ALCOHOLICO 12.0 06BOTELLAS EN ENVASE DE 750 ML SIN EMBALAR', 'VINO ESPUMOSO CHARDONNAY PINOT NOIR EXTRA BR DE UVA, GR.ALC.12.80, ACIDEZ VOL. 0.46 G/L.,CAJAS DE 6 BOLTELLAS DE 750 ML. SIN EMBALAR', 'VINO PINOT NOIR ROSE BRUT GA 12.0 AV 0.45 COSECHA 2013 CON DENOMINACION DE ORIGEN EN CAJAS CON BOTELLAS DE 6X750CC SIN EMBALAR', 'VINO ESPUMOSO ROSE GRADO ALCOHOLICO 11.8 ACIDEZ VOLATIL 0.37 COSECHA 2013 EN CAJAS DE 06X750 ML SIN EMBALAR VINO CON DENOMINACION DE ORIGEN ESPUMOSO SPARKLING ANGEL BRUT GRADO ALCOHOLICO 12.0 06BOTELLAS EN ENVASE DE 750 ML SIN EMBALAR VINO ESPUMOSO CHARDONNAY PINOT NOIR EXTRA BR DE UVA, GR.ALC.12.80, ACIDEZ VOL. 0.46 G/L.,CAJAS DE 6 BOLTELLAS DE 750 ML. SIN EMBALAR VINO PINOT NOIR ROSE BRUT GA 12.0 AV 0.45 COSECHA 2013 CON DENOMINACION DE ORIGEN EN CAJAS CON BOTELLAS DE 6X750CC SIN EMBALAR'],
'PRICE': [6.33, 5.43, 2.79, 3.07, 1.96]})
words = pd.DataFrame({'WORD':['VINO', 'ESPUMOSO', 'ROSE', 'GRADO', 'ALCOHOLICO', '11.8', 'ACIDEZ', 'VOLATIL', '0.37', 'COSECHA', '2013', 'EN', 'CAJAS', 'DE', '06X750', 'ML', 'SIN', 'EMBALAR', 'CON', 'DENOMINACION', 'ORIGEN', 'SPARKLING', 'ANGEL', 'BRUT', '12.0', '06BOTELLAS', 'ENVASE', '750', 'CHARDONNAY', 'PINOT', 'NOIR', 'EXTRA', 'BR', 'UVA,', 'GR.ALC.12.80,', 'VOL.', '0.46', 'G/L.,CAJAS', '6', 'BOLTELLAS', 'ML.', 'GA', 'AV', '0.45', 'BOTELLAS', '6X750CC', ]})
Thank you very much!
Upvotes: 3
Views: 445
Reputation: 42916
Since I answered your last question, it's quite easy for me to see the problem. The reason you get a higher sum, is because a word can occur multiple times in a sentence. So use DataFrame.drop_duplicates
before GroupBy
:
matches = (
prices['TEXT'].str.extractall(f'({"|".join(words["WORD"])})')
.rename(columns={0:'WORDS'})
.rename_axis(['index', 'match'])
)
final = (
prices.rename_axis('index')
.join(matches)
.drop_duplicates()
.groupby('WORDS', sort=False)['PRICE'].sum()
.reset_index(name='SUM_PRICE')
)
WORDS SUM_PRICE
0 VINO 19.58
1 ESPUMOSO 16.51
2 ROSE 11.36
3 GRADO 13.72
4 ALCOHOLICO 13.72
5 11.8 8.29
6 ACIDEZ 11.08
7 VOLATIL 8.29
8 0.37 8.29
9 COSECHA 11.36
10 2013 11.36
11 EN 16.79
12 CAJAS 11.36
13 DE 19.58
14 06X750 8.29
15 ML 16.51
16 SIN 19.58
17 EMBALAR 19.58
18 CON 10.46
19 ORIGEN 10.46
20 SPARKLING 7.39
21 ANGEL 7.39
22 BRUT 10.46
23 12.0 10.46
24 06BOTELLAS 7.39
25 750 13.25
26 CHARDONNAY 4.75
27 PINOT 7.82
28 NOIR 7.82
29 EXTRA 4.75
30 BR 4.75
31 UVA, 4.75
32 GR.ALC.12.80, 4.75
33 VOL. 4.75
34 0.46 4.75
35 G/L.,CAJAS 4.75
36 6 7.82
37 BOLTELLAS 4.75
38 GA 5.03
39 AV 5.03
40 0.45 5.03
41 BOTELLAS 5.03
Upvotes: 1