Reputation: 993
I have a DataFrame which has a column named AlternateName. This contains names of different ingredients, but most of them have amounts and units before the actual name.
Alternate_Name
2 ★ Sukkerfri chokolade
3 100 g. sukkerfri 70% mørk chokolade
4 sukkerfri chokolade
5 50 g. sukkerfri 70% mørk chokolade
6 Chokoladesovs uden tilsat sukker
7 1 spsk Chokolade proteinpulver
8 1 spsk proteinpulver (chokolade)
9 1,5 spsk chokolade proteinpulver
10 spsk chokolade proteinpulver
11 stor spsk chokolade proteinpulver
12 30 g chokoladeproteinpulver
13 30 g Linus Pro proteinpulver med Kakao
14 30 g proteinpulver med Kakao fra Linus Pro*
15 45 g proteinpulver (jeg brugte chokolade/hasse...
16 50 g chokolade og banan proteinpulver (HER)
17 ,5 spsk vanilleproteinpulver
18 1 spsk proteinpulver – Vanille smag
19 1 spsk vanille proteinpulver
20 1 spsk vanille proteinpulver
21 1 stor spsk vanille proteinpulver
22 10 g vanille proteinpulver
23 spsk vanilje protein pulver
24 spsk Vanille Protein pulver
25 spsk Vanille proteinpulver
26 spsk vanilleproteinpulver (eller lidt vanilles...
27 30 g Linus Pro Proteinpulver med vanille
28 30 g vanille proteinpulver fra Linus Pro (Re...
29 30 g vanille proteinpulver
30 40 g vanilleproteinpulver
31 60 g vanille proteinpulver
I already tried this: df = df["AlternateName"].map(lambda x: x.lstrip('200 g.'))
- however, I need to add specific conditions on which these strings shall be trimmed as I can not do that manually for each and every situation.
Therefore, how can I teach my program to remove string contents using conditions to match numbers, units and special characters situated before every ingredient name?
ex: 200 g. sukkerfri chokolade -> sukkerfri chokolade
★ Sukkerfri chokolade -> Sukkerfri chokolade
I am not quite familiar with python, so any help like methods, tips, hints, are welcome!
Upvotes: 2
Views: 102
Reputation: 4521
Have you already tried to apply a regex to remove the quantities? Like this:
df['Alternate_Name'].str.replace(r'^\s*(★|[0-9]*,?[0-9]{1,}\s*(g|kg|spsk|stor spsk)|spsk)\s*,*', '')
It outputs:
Out[71]:
0 ★ Sukkerfri chokolade
1 . sukkerfri 70% mørk chokolade
2 sukkerfri chokolade
3 . sukkerfri 70% mørk chokolade
4 Chokoladesovs uden tilsat sukker
5 Chokolade proteinpulver
6 proteinpulver (chokolade)
7 chokolade proteinpulver
8 chokolade proteinpulver
9 stor spsk chokolade proteinpulver
10 chokoladeproteinpulver
11 Linus Pro proteinpulver med Kakao
12 proteinpulver med Kakao fra Linus Pro*
13 proteinpulver (jeg brugte chokolade/hasse...
14 chokolade og banan proteinpulver (HER)
15 vanilleproteinpulver
16 proteinpulver – Vanille smag
17 vanille proteinpulver
18 vanille proteinpulver
19 vanille proteinpulver
20 vanille proteinpulver
21 vanilje protein pulver
22 Vanille Protein pulver
23 Vanille proteinpulver
24 vanilleproteinpulver (eller lidt vanilles...
25 Linus Pro Proteinpulver med vanille
26 vanille proteinpulver fra Linus Pro (Re...
27 vanille proteinpulver
28 vanilleproteinpulver
29 vanille proteinpulver
Name: Alternate_Name, dtype: object
Upvotes: 1