Reputation: 676
I have a column that's a mix of names and ID numbers like
Name 1
Name 2
Name 3
CAP 1 TRE001
CAP 2 TRE002
PB 1 EHS001
...
The ID numbers always start with TRE or EHS. I want to remove those from the column so the column is
Name 1
Name 2
Name 3
CAP 1
CAP 2
PB 2
...
I know I can use str.replace with some regex expression and replace with " ". (Can probably use str.strip as well?). But I just can't seem to understand regex very well
Edit: I should have been more detailed- sometimes the names can be like
Name 1 Low
Name 2 Low
Name 3
Name 4
CAP 1 TRE001
etc
which is why if I use str.split I will sometimes cut off a portion of the Names which I don't want
Upvotes: 1
Views: 75
Reputation: 4487
Let's look at an example on a single word:
text = 'CAP 1 TRE001'
' '.join(text.split(' ')[0:2])
and you will get:
# output: 'CAP 1'
Now we apply the same code to the list containing the column names:
col_names = ['CAP 1 TRE001', 'CAP 2 TRE002', 'PB 1 EHS001']
[' '.join(col.split(' ')[0:2])for col in col_names]
and gives:
# output: ['CAP 1', 'CAP 2', 'PB 1']
Or if you have a pandas dataframe like this:
df = pd.DataFrame(['CAP 1 TRE001', 'CAP 2 TRE002', 'PB 1 EHS001'], columns=['Names'])
you have to do this:
df['Names'] = df['Names'].apply(lambda s: ' '.join(s.split(' ')[0:2]))
and gives:
Names
0 CAP 1
1 CAP 2
2 PB 1
Upvotes: 0
Reputation: 58
Try using a simple 'or' regex pattern on split. Next, pick only the first element returned, it should look something like:
series.str.split('TRE|EHS').str[0]
Upvotes: 0
Reputation: 92854
With specific regex pattern:
In [17]: df.col.str.replace(r'\s*\b(TRE|EHS).*$', '')
Out[17]:
0 Name 1
1 Name 2
2 Name 3
3 CAP 1
4 CAP 2
5 PB 1
Name: col, dtype: object
Upvotes: 1
Reputation: 323226
We using str.split
df['col']=df.col.str.split(' ',n=2).str[:2].str.join(sep=' ')
Or we using str.rsplit
df.col.str.rsplit(' ',n=1).str[0]
Upvotes: 0