ojp
ojp

Reputation: 1033

Strip characters to the left of a specific character in a pandas column

I have the following data:

key German
0   0:- Profile 1
1   1:- Archetype   Realist*in
2   2:- RIASEC Code:    R- Realistic
3   3:- Subline Deine Stärke? Du bleibst dir selber treu.
4   4:- Copy    Dein Erfolg basiert auf deiner praktischen Ver...

In the "Key" column I would like to remove the numbers and colon dash which follows. This order is always the same (from the left). So for the first row I would like to remove "0:- ", and just leave "Profile 1". I am struggling to find the correct regex expression to do what I want. Originally I tried the following:

df_json['key'] = df_json['key'].map(lambda x: x.strip(':- ')[1])

However, this approach is too restrictive since there can be multiple words in the field.

I would like to use pd.Series.str.replace(), but I cant figure out the correct regex expression to achieve the desired results. Any help would be greatly appreciated.

Upvotes: 3

Views: 1944

Answers (4)

JvdV
JvdV

Reputation: 75840

What about just using pandas.Series.str.partition instead of regular expressions:

df['German'] = df['German'].str.partition()[2]

This would split the series on the 1st space only and grab the trailing part. Alternatively to partition you could also just split:

df['German'] = df['German'].str.split(' ', 1).str[1]

If regex is a must for you, maybe use a lazy quantifier to match upto the 1st space character:

df['German'] = df['German'].replace('^.*? +','', regex=True)

Where:

  • ^ - Start line anchor.
  • .*? - Any 0+ (lazy) characters other than newline upto;
  • + - 1+ literal space characters.

Here is an online demo

Upvotes: 3

wwnde
wwnde

Reputation: 26676

Extract any non white Space \S and Non Digits \D which are immediately to the left of unwanted characters

df['GermanFiltered']=df['German'].str.extract("((?<=^\d\:\-\s)\S+\D+)")

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626802

You need

df_json['key'] = df_json['key'].str.replace(r'^\d+:-\s*', '', regex=True)

See the regex demo and the regex graph:

enter image description here

Details:

  • ^ - start of string
  • \d+ - one or more digits
  • : - a colon
  • - - a hyphen
  • \s* - zero or more whitespaces

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133508

With your shown samples, please try following. Using replace function of Pandas here. Simple explanation would be, apply replace function of Pandas to German column of dataframe and then use regex ^[0-9]+:-\s+ to replace values with NULL.

df['German'].replace('(^[0-9]+:-\s+)','', regex=True)

Explanation:

  • ^[0-9]+: match starting digits followed by colon here.
  • :-\s+: Match colon, followed by - followed by 1 or more space occurrences.

Upvotes: 5

Related Questions