edyvedy13
edyvedy13

Reputation: 2296

Regular expression removing elements between comma and semi column and respecting the order

I have the following text my_text= "Volberda, Henk W.; Van Den Bosch, Frans A.J.; Mihalache, Oli R." I would like the get only the last names i.e. Volberda, Van Den Bosch, Mihalache, I tried something like this:

import re
lastnames = re.sub(', [^>]+;', '', my_text)

but I got

Volberda Mihalache, Oli R.

Would appreciate any help

Upvotes: 0

Views: 52

Answers (2)

The fourth bird
The fourth bird

Reputation: 163477

In your pattern [^>]+ you are matching any character except >

You might instead match any character except ; or , and use a positive lookbehind (?<=,) to keep the comma in the output.

(?<=,) [^;,]+(?:;|$)
  • (?<=,) Positive lookbehind, assert a , on the left and match a space
  • [^;,]+ Match 1 times any char except ; or ,
  • (?:;|$) Match either ; or assert end of string

Regex demo | Python demo

import re
my_text= "Volberda, Henk W.; Van Den Bosch, Frans A.J.; Mihalache, Oli R."
lastnames = re.sub(r'(?<=,) [^;,]+(?:;|$)', '', my_text)
print(lastnames)

Output

Volberda, Van Den Bosch, Mihalache,

Upvotes: 1

yatu
yatu

Reputation: 88275

Looks like string methods should suffice here:

[i.split(',')[0].strip() for i in my_text.split(';')]
# ['Volberda', 'Van Den Bosch', 'Mihalache']

Upvotes: 2

Related Questions