Reputation: 1033
I have the following data:
key German
0 0:- Profile 1
1 1:- Archetype Realist*in
2 2:- RIASEC Code: R- Realistic
3 3:- Subline Deine Stärke? Du bleibst dir selber treu.
4 4:- Copy Dein Erfolg basiert auf deiner praktischen Ver...
In the "Key" column I would like to remove the numbers and colon dash which follows. This order is always the same (from the left). So for the first row I would like to remove "0:- ", and just leave "Profile 1". I am struggling to find the correct regex expression to do what I want. Originally I tried the following:
df_json['key'] = df_json['key'].map(lambda x: x.strip(':- ')[1])
However, this approach is too restrictive since there can be multiple words in the field.
I would like to use pd.Series.str.replace()
, but I cant figure out the correct regex expression to achieve the desired results. Any help would be greatly appreciated.
Upvotes: 3
Views: 1944
Reputation: 75840
What about just using pandas.Series.str.partition instead of regular expressions:
df['German'] = df['German'].str.partition()[2]
This would split the series on the 1st space only and grab the trailing part. Alternatively to partition you could also just split:
df['German'] = df['German'].str.split(' ', 1).str[1]
If regex is a must for you, maybe use a lazy quantifier to match upto the 1st space character:
df['German'] = df['German'].replace('^.*? +','', regex=True)
Where:
^
- Start line anchor..*?
- Any 0+ (lazy) characters other than newline upto; +
- 1+ literal space characters.Here is an online demo
Upvotes: 3
Reputation: 26676
Extract any non white Space \S
and Non Digits \D
which are immediately to the left of unwanted characters
df['GermanFiltered']=df['German'].str.extract("((?<=^\d\:\-\s)\S+\D+)")
Upvotes: 2
Reputation: 626802
You need
df_json['key'] = df_json['key'].str.replace(r'^\d+:-\s*', '', regex=True)
See the regex demo and the regex graph:
Details:
^
- start of string\d+
- one or more digits:
- a colon-
- a hyphen\s*
- zero or more whitespacesUpvotes: 2
Reputation: 133508
With your shown samples, please try following. Using replace
function of Pandas here. Simple explanation would be, apply replace
function of Pandas to German
column of dataframe and then use regex ^[0-9]+:-\s+
to replace values with NULL.
df['German'].replace('(^[0-9]+:-\s+)','', regex=True)
Explanation:
^[0-9]+
: match starting digits followed by colon here.:-\s+
: Match colon, followed by -
followed by 1 or more space occurrences.Upvotes: 5