Reputation: 880

Remove digits from a list of strings in pandas column

I have this pandas dataframe

0  Tokens 
1: 'rice', 'XXX', '250g'
2: 'beer', 'XXX', '750cc'

All tokens here, 'rice', 'XXX' and '250g' are in the same list of strings, also in the same column

I want to remove the digits, and because it with another words, the digits cannot be removed.

I have tried this code:

def remove_digits(tokens):
    """
    Remove digits from a string
    """
    return [''.join([i for i in tokens if not i.isdigit()])]

df["Tokens"] = df.Tokens.apply(remove_digits)
df.head()

but it only joined the strings, and I clearly do not want to do that.

My desired output:

0  Tokens
1: 'rice' 'XXX' 'g'
2: 'beer', 'XXX', 'cc'

Upvotes: 2

Answers (3)

Alex

Reputation: 7075

This is possible using pandas methods, which are vectorised so more efficient that looping.

import pandas as pd

df = pd.DataFrame({"Tokens": [["rice", "XXX", "250g"], ["beer", "XXX", "750cc"]]})

col = "Tokens"
df[col] = (
    df[col]
    .explode()
    .str.replace("\d+", "", regex=True)
    .groupby(level=0)
    .agg(list)
)
#             Tokens
# 0   [rice, XXX, g]
# 1  [beer, XXX, cc]

Here we use:

pandas.Series.explode to convert the Series of lists into rows
pandas.Series.str.replace to replace occurrences of \d (number 0-9) with "" (nothing)
pandas.Series.groupby to group the Series by index (level=0) and put them back into lists (.agg(list))

Upvotes: 2

Carmoreno

Reputation: 1319

You can use to_list + re.sub in order to update your original dataframe.

import re

for index, lst in enumerate(df['Tokens'].to_list()):
  lst = [re.sub('\d+', '', i) for i in lst]
  df.loc[index, 'Tokens'] = lst

print(df)

Output:

    Tokens
0   [rice, XXX, g]
1   [beer, XXX, cc]

Upvotes: 0

ShlomiF

Reputation: 2905

Here's a simple solution -

df = pd.DataFrame({'Tokens':[['rice', 'XXX', '250g'], 
                             ['beer', 'XXX', '750cc']]})

def remove_digits_from_string(s):
    return ''.join([x for x in s if not x.isdigit()])

def remove_digits(l):
    return [remove_digits_from_string(s) for s in l]

df["Tokens"] = df.Tokens.apply(remove_digits)

Upvotes: 0

Remove digits from a list of strings in pandas column

Answers (3)

Related Questions