Reputation: 676

Use regex to remove sub string from column that starts with certain characters

I have a column that's a mix of names and ID numbers like

Name 1
Name 2
Name 3 
CAP 1 TRE001 
CAP 2 TRE002 
PB 1  EHS001 
...

The ID numbers always start with TRE or EHS. I want to remove those from the column so the column is

Name 1
Name 2
Name 3 
CAP 1
CAP 2
PB 2
...

I know I can use str.replace with some regex expression and replace with " ". (Can probably use str.strip as well?). But I just can't seem to understand regex very well

Edit: I should have been more detailed- sometimes the names can be like

Name 1 Low
Name 2 Low 
Name 3 
Name 4
CAP 1 TRE001
etc

which is why if I use str.split I will sometimes cut off a portion of the Names which I don't want

Upvotes: 1

Answers (4)

Massifox

Reputation: 4487

Let's look at an example on a single word:

text = 'CAP 1 TRE001'
' '.join(text.split(' ')[0:2])

and you will get:

# output: 'CAP 1'

Now we apply the same code to the list containing the column names:

col_names = ['CAP 1 TRE001', 'CAP 2 TRE002', 'PB 1  EHS001']
[' '.join(col.split(' ')[0:2])for col in col_names]

and gives:

# output: ['CAP 1', 'CAP 2', 'PB 1']

Or if you have a pandas dataframe like this:

df = pd.DataFrame(['CAP 1 TRE001', 'CAP 2 TRE002', 'PB 1  EHS001'], columns=['Names'])

you have to do this:

df['Names'] = df['Names'].apply(lambda s: ' '.join(s.split(' ')[0:2]))

and gives:

   Names
0  CAP 1
1  CAP 2
2   PB 1

Upvotes: 0

Hadas Arik

Reputation: 58

Try using a simple 'or' regex pattern on split. Next, pick only the first element returned, it should look something like:

series.str.split('TRE|EHS').str[0]

Upvotes: 0

RomanPerekhrest

Reputation: 92854

With specific regex pattern:

In [17]: df.col.str.replace(r'\s*\b(TRE|EHS).*$', '')
Out[17]: 
0    Name 1
1    Name 2
2    Name 3
3     CAP 1
4     CAP 2
5      PB 1
Name: col, dtype: object

Upvotes: 1

BENY

Reputation: 323226

We using str.split

df['col']=df.col.str.split(' ',n=2).str[:2].str.join(sep=' ')

Or we using str.rsplit

df.col.str.rsplit(' ',n=1).str[0]

Upvotes: 0

Use regex to remove sub string from column that starts with certain characters

Answers (4)

Related Questions