Reputation: 880
I have this pandas dataframe
0 Tokens
1: 'rice', 'XXX', '250g'
2: 'beer', 'XXX', '750cc'
All tokens here, 'rice', 'XXX' and '250g' are in the same list of strings, also in the same column
I want to remove the digits, and because it with another words, the digits cannot be removed.
I have tried this code:
def remove_digits(tokens):
"""
Remove digits from a string
"""
return [''.join([i for i in tokens if not i.isdigit()])]
df["Tokens"] = df.Tokens.apply(remove_digits)
df.head()
but it only joined the strings, and I clearly do not want to do that.
My desired output:
0 Tokens
1: 'rice' 'XXX' 'g'
2: 'beer', 'XXX', 'cc'
Upvotes: 2
Views: 645
Reputation: 7045
This is possible using pandas methods, which are vectorised so more efficient that looping.
import pandas as pd
df = pd.DataFrame({"Tokens": [["rice", "XXX", "250g"], ["beer", "XXX", "750cc"]]})
col = "Tokens"
df[col] = (
df[col]
.explode()
.str.replace("\d+", "", regex=True)
.groupby(level=0)
.agg(list)
)
# Tokens
# 0 [rice, XXX, g]
# 1 [beer, XXX, cc]
Here we use:
pandas.Series.explode
to convert the Series of lists into rowspandas.Series.str.replace
to replace occurrences of \d
(number 0-9) with ""
(nothing)pandas.Series.groupby
to group the Series by index (level=0
) and put them back into lists (.agg(list)
)Upvotes: 2
Reputation: 1319
You can use to_list
+ re.sub
in order to update your original dataframe.
import re
for index, lst in enumerate(df['Tokens'].to_list()):
lst = [re.sub('\d+', '', i) for i in lst]
df.loc[index, 'Tokens'] = lst
print(df)
Output:
Tokens
0 [rice, XXX, g]
1 [beer, XXX, cc]
Upvotes: 0
Reputation: 2895
Here's a simple solution -
df = pd.DataFrame({'Tokens':[['rice', 'XXX', '250g'],
['beer', 'XXX', '750cc']]})
def remove_digits_from_string(s):
return ''.join([x for x in s if not x.isdigit()])
def remove_digits(l):
return [remove_digits_from_string(s) for s in l]
df["Tokens"] = df.Tokens.apply(remove_digits)
Upvotes: 0