StuckInPhDNoMore
StuckInPhDNoMore

Reputation: 2689

How to remove all occurrences of a substring from the end of a column in a dataframe?

I have a dataframe with a column containing hundreds of rows of strings such as:

nh sh sl hhlh lsl s h h lhlll hh l sh hl sl l shhllh sl h shhl hhl ll s s lhhlh lhl sl s sh l shhlll h hl hhl sllh ll s hh sl hhlh sl s sl l hl hhl lhhllh sl nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll
n s s s s s s s s s h s sl sl s s sh sl s nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll
nhlhh n sh sll hh shl lhh s s hh sl hl hhlh lhhl sl lh s slhllhs lh s sh sl h shhl sl sl hhl h sh slsll hh lhh hlll hhl ll hhs s s sll hs lh hsl hll h s sl hh s s lhhlll lhl hl hhs hhhlll hhl hl hhs hlllh hs sh sl hll hh shhlh ll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll nsshll

I want to remove the nsshll that is appended to every row. For example, the above three rows would become:

nh sh sl hhlh lsl s h h lhlll hh l sh hl sl l shhllh sl h shhl hhl ll s s lhhlh lhl sl s sh l shhlll h hl hhl sllh ll s hh sl hhlh sl s sl l hl hhl lhhllh sl
n s s s s s s s s s h s sl sl s s sh sl s
nhlhh n sh sll hh shl lhh s s hh sl hl hhlh lhhl sl lh s slhllhs lh s sh sl h shhl sl sl hhl h sh slsll hh lhh hlll hhl ll hhs s s sll hs lh hsl hll h s sl hh s s lhhlll lhl hl hhs hhhlll hhl hl hhs hlllh hs sh sl hll hh shhlh ll

I've tried to remove them using rstrip

nhl_pred['nhl-predicted'] = nhl_pred['nhl-predicted'].str.rstrip(' nsshll')

but that clears out the entire string and returns and empty column.

I then tried with regex

nhl_pred['nhl-predicted'] = nhl_pred['nhl-predicted'].str.replace(r' nsshll$', '')

But this either does nothing or removes only the very last substring while leaving the rest.

How would I achieve my desired result?

Thanks

Upvotes: 1

Views: 47

Answers (1)

Patrick Artner
Patrick Artner

Reputation: 51683

When using str.rstrip(' nsshll') you provide a set of characters to remove - not an string - that is why all your content gets deleted.

You can use regex and specify a amount of + (1 or more ocurences) for your pattern (that you put into a non-capturing group (?: ....) to effect it as whole pattern and not just apply + to the last 'l'):

str.replace(r'(?: nsshll)+$', '')

Upvotes: 2

Related Questions