Tomasz Przemski
Tomasz Przemski

Reputation: 1127

Removing stand-alone numbers from string in Python

There are a lot of similar questions, but I have not found a solution for my problem. I have a data frame with the following structure/form:

   col_1
0  BULKA TARTA 500G KAJO 1
1  CUKIER KRYSZTAL 1KG KSC 4
2  KASZA JĘCZMIENNA 4*100G 2 0.92
3  LEWIATAN MAKARON WSTĄŻKA 1 0.89

However, I want to achieve the effect:

   col_1
0  BULKA TARTA 500G KAJO
1  CUKIER KRYSZTAL 1KG KSC
2  KASZA JĘCZMIENNA 4*100G
3  LEWIATAN MAKARON WSTĄŻKA

So I want to remove the independent natural and decimal numbers, but leave the numbers in the string with the letters.

I tried to use df.col_1.str.isdigit().replace([True, False],[np.nan, df.col_1]) , but it only works on comparing the entire cell whether it is a number or not.

You have some ideas how to do it? Or maybe it would be good to break the column with spaces and then compare?

Upvotes: 0

Views: 927

Answers (3)

Anton vBR
Anton vBR

Reputation: 18906

We could create a function that tries to convert to float. If it fails we return True (not_float)

import pandas as pd

df = pd.DataFrame({"col_1" : ["BULKA TARTA 500G KAJO 1",
                              "CUKIER KRYSZTAL 1KG KSC 4",
                              "KASZA JĘCZMIENNA 4*100G 2 0.92",
                              "LEWIATAN MAKARON WSTĄŻKA 1 0.89"]})

def is_not_float(string):
    try:
        float(string)
        return False
    except ValueError:  # String is not a number
        return True

df["col_1"] = df["col_1"].apply(lambda x: [i for i in x.split(" ") if is_not_float(i)])

df

Or following the example of my fellow SO:ers. However this would treat 130. as a number.

df["col_1"] = (df["col_1"].apply(
    lambda x: [i for i in x.split(" ") if not i.replace(".","").isnumeric()]))

Returns

                          col_1
0    [BULKA, TARTA, 500G, KAJO]
1  [CUKIER, KRYSZTAL, 1KG, KSC]
2   [KASZA, JĘCZMIENNA, 4*100G]
3  [LEWIATAN, MAKARON, WSTĄŻKA]

Upvotes: 1

Johny Vaknin
Johny Vaknin

Reputation: 287

Yes you can

def no_nums(col):
    return ' '.join(filter(lambda word:word.replace('.','').isdigit()==False, col.split()))
df.col_1.apply(no_nums)

This filters out words from each value which are completely made of digits,
And maybe contains a decimal point.
If you want to filter out numbers like 1,000, simply add another replace for ','

Upvotes: 0

alexisdevarennes
alexisdevarennes

Reputation: 5632

Sure,

You could use a regex.

import re
df.col_1 = re.sub("\d+\.?\d+?", "",  df.col_1)

Upvotes: 1

Related Questions