Reputation: 2296
I have the following text my_text= "Volberda, Henk W.; Van Den Bosch, Frans A.J.; Mihalache, Oli R."
I would like the get only the last names i.e. Volberda, Van Den Bosch, Mihalache,
I tried something like this:
import re
lastnames = re.sub(', [^>]+;', '', my_text)
but I got
Volberda Mihalache, Oli R.
Would appreciate any help
Upvotes: 0
Views: 52
Reputation: 163477
In your pattern [^>]+
you are matching any character except >
You might instead match any character except ;
or ,
and use a positive lookbehind (?<=,)
to keep the comma in the output.
(?<=,) [^;,]+(?:;|$)
(?<=,)
Positive lookbehind, assert a ,
on the left and match a space[^;,]+
Match 1 times any char except ;
or ,
(?:;|$)
Match either ;
or assert end of stringimport re
my_text= "Volberda, Henk W.; Van Den Bosch, Frans A.J.; Mihalache, Oli R."
lastnames = re.sub(r'(?<=,) [^;,]+(?:;|$)', '', my_text)
print(lastnames)
Output
Volberda, Van Den Bosch, Mihalache,
Upvotes: 1
Reputation: 88275
Looks like string methods should suffice here:
[i.split(',')[0].strip() for i in my_text.split(';')]
# ['Volberda', 'Van Den Bosch', 'Mihalache']
Upvotes: 2