ds_1234
ds_1234

Reputation: 81

How to remove first word in string if it matches one of two words

I am looking at a column of strings, a large number starts with either In or For. I would like to remove these from the first word but only if it matches these values.

For example:

data['description'] = 'For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology'

I would like to have:

data['description'] = 'people aged 3 and above', 'the cleaning aisle', 'introducing pioneering technology'

where the word introducing isn't impacted by the change.

I have tried variations of this:

words = ('in ','for ')
if data['qual_name_test'].str.startswith(words):
   data['qual_name_test'] = data['qual_name_test'][len(words):].lstrip()

However I get this error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Would anyone have experience with this?

Upvotes: 0

Views: 323

Answers (4)

Adon Bilivit
Adon Bilivit

Reputation: 27196

Yet another approach. Note that the value associated with data['description'] is a tuple therefore:

data = {}

data['description'] = 'For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology'

def fix(t):
    r = tuple()
    for e in t:
        if (w := e.split()) and w[0] in {'For', 'for', 'In', 'in'}:
            r = *r, e[len(w[0])+1:]
        else:
            r = *r, e
    return r

data['description'] = fix(data['description'])

print(*data['description'], sep=', ')

Output:

people aged 3 and above, the cleaning aisle, introducing pioneering technology

Upvotes: 1

Cpt.Hook
Cpt.Hook

Reputation: 618

The most straightforward way is a simple comparison for each case

data = ['For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology']
result = []

FOR = 'for '
IN = 'in '

for entry in data:
    # entry is now a string from the list
    if entry.lower().startswith(FOR):
        result.append(entry[4:])
    elif entry.lower().startswith(IN):
        result.append(entry[3:])
    else:
        result.append(entry)

A bit more sophisticated is a regular expression replacement, doing the manual work for you

import re

data = ['For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology']

pattern = re.compile('(?i)^(for |in )')
result = [re.sub(pattern, '', entry, count=0, flags=0)) for entry in data]

Both scripts can be tested with

print(data)
print(result)

and yield

['For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology']
['people aged 3 and above', 'the cleaning aisle', 'introducing pioneering technology']

Upvotes: 4

tdelaney
tdelaney

Reputation: 77377

.replace accepts a regular expression. Match the stuff you don't want and replace with an empty string. The ^ matches the start of the string, (?i) ignores case. Then list words separated with |.

data['description'].replace(r"^(?i)(for|in)\s+", "", regex=True)

Upvotes: 5

try using a dataframe apply and the startswith function on the string

lst=['For people aged 3 and for above', 'in the cleaning in aisle', 'introducing pioneering technology']

df=pd.DataFrame(lst,columns=["Phrase"])


def StripForAndIn(sentence):
    words=["For ","in ","In ","For "]
    for word in words:
        if sentence.startswith(word):
            sentence=sentence[len(word):]
    return sentence

df["Phrase2"]=df["Phrase"].apply(StripForAndIn)

print(df)

Upvotes: 1

Related Questions