Reputation: 81
I am looking at a column of strings, a large number starts with either In or For. I would like to remove these from the first word but only if it matches these values.
For example:
data['description'] = 'For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology'
I would like to have:
data['description'] = 'people aged 3 and above', 'the cleaning aisle', 'introducing pioneering technology'
where the word introducing isn't impacted by the change.
I have tried variations of this:
words = ('in ','for ')
if data['qual_name_test'].str.startswith(words):
data['qual_name_test'] = data['qual_name_test'][len(words):].lstrip()
However I get this error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Would anyone have experience with this?
Upvotes: 0
Views: 323
Reputation: 27196
Yet another approach. Note that the value associated with data['description'] is a tuple therefore:
data = {}
data['description'] = 'For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology'
def fix(t):
r = tuple()
for e in t:
if (w := e.split()) and w[0] in {'For', 'for', 'In', 'in'}:
r = *r, e[len(w[0])+1:]
else:
r = *r, e
return r
data['description'] = fix(data['description'])
print(*data['description'], sep=', ')
Output:
people aged 3 and above, the cleaning aisle, introducing pioneering technology
Upvotes: 1
Reputation: 618
The most straightforward way is a simple comparison for each case
data = ['For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology']
result = []
FOR = 'for '
IN = 'in '
for entry in data:
# entry is now a string from the list
if entry.lower().startswith(FOR):
result.append(entry[4:])
elif entry.lower().startswith(IN):
result.append(entry[3:])
else:
result.append(entry)
A bit more sophisticated is a regular expression replacement, doing the manual work for you
import re
data = ['For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology']
pattern = re.compile('(?i)^(for |in )')
result = [re.sub(pattern, '', entry, count=0, flags=0)) for entry in data]
Both scripts can be tested with
print(data)
print(result)
and yield
['For people aged 3 and above', 'in the cleaning aisle', 'introducing pioneering technology']
['people aged 3 and above', 'the cleaning aisle', 'introducing pioneering technology']
Upvotes: 4
Reputation: 77377
.replace
accepts a regular expression. Match the stuff you don't want and replace with an empty string. The ^
matches the start of the string, (?i)
ignores case. Then list words separated with |
.
data['description'].replace(r"^(?i)(for|in)\s+", "", regex=True)
Upvotes: 5
Reputation: 4253
try using a dataframe apply and the startswith function on the string
lst=['For people aged 3 and for above', 'in the cleaning in aisle', 'introducing pioneering technology']
df=pd.DataFrame(lst,columns=["Phrase"])
def StripForAndIn(sentence):
words=["For ","in ","In ","For "]
for word in words:
if sentence.startswith(word):
sentence=sentence[len(word):]
return sentence
df["Phrase2"]=df["Phrase"].apply(StripForAndIn)
print(df)
Upvotes: 1