Reputation: 27
I´m struggling formulating a regular expression to extract all the species names (group1) and the author names (group2) from a list. I´m fairly new to python and would appreciate any help.
This is a part of the list:
Dalbergia acutifoliolata Mendonca & Sousa
Dalbergia adami Berhaut
Dalbergia afzeliana G.Don
Dalbergia agudeloi J.Linares & M. Sousa
Dalbergia albiflora Hutch. & Dalziel
Dalbergia altissima Baker f.
Dalbergia amazonica (Radlk.) Ducke
Dalbergia amerimmon L. ex B.D.Jacks
Dalbergia andapensis Bosser & R.Rabev.
Dalbergia arbutifolia Baker
Dalbergia arbutifolia aberrans Polhill
Dalbergia armata E.Mey.
Dalbergia assamica Benth.
Dalbergia aurea Bosser & R.Rabev.
Dalbergia baronii Baker
Dalbergia bathiei R.Vig.
Dalbergia benthamii
Dalbergia berteroi
Dalbergia pseudo-sissoo Miq.
Dalbergia ovata var. glomeriflora (Kurz) Thoth.
Dalbergia albiflora subsp. albiflora
Usually species names have a genus and a species name, and some have a subspecies name. I can catch those with:
([A-Z][a-z]*[\s]{1}[a-z]*|[A-Z][a-z]*[\s]{1}[a-z]*[\s]{1}[a-z]*)
I don´t catch some exceptions like "Dalbergia pseudo-sissoo" because of the "-". And I don´t know how to handle varieties "var." or cases where subspecies are indicated with "subsp."
I cannot come up with a regex to handle the complex structure of the authors. They always start with a capital letter or "(".
Some entries do not have authors, I still want the species entries to be returned.
This is my attempt so far, but doesn´t get everything I want:
([A-Z][a-z]*[\s]{1}[a-z]*|[A-Z][a-z]*[\s]{1}[a-z]*[\s]{1}[a-z]*)\s([A-Z][a-z]*|[(][A-Z][a-z]*|[\0])
It does match with any genus name, if the line before didn´t have an author name.
Thank you in advance for your help!
Upvotes: 1
Views: 610
Reputation: 127
You can try this:
(^\w+.*?)(?:([A-Z(].*)|$)
(^\w+.*?)
Capture the words/characters before the first or no occurrence of author name(which can start with a capital letter or a left round bracket).
(?:([A-Z(].*)|$)
Capture the author name starting with an uppercase letter or a round bracket if it exists else match if it is the end of the string
Upvotes: 1
Reputation: 20737
This should do it:
^(?P<species>[A-Z][^A-Z(]+)(?P<author>(?<!^).*)$
^
- assert start of line(?P<species>[A-Z][^A-Z(]+)
- named capture group "species" must start with one capital letter and then fetch everything not a capital nor open parenthesis(?P<author>(?<!^).*)
- named capture group "author" cannot be at the start of the line and capture everything till the end of the line$
- assert end of linehttps://regex101.com/r/wzDSXE/1/
Upvotes: 4
Reputation: 12027
What about regex like ^([A-Z][^A-Z(]+)([(A-Z].+)?$
import re
data = """Dalbergia acutifoliolata Mendonca & Sousa
Dalbergia adami Berhaut
Dalbergia afzeliana G.Don
Dalbergia agudeloi J.Linares & M. Sousa
Dalbergia albiflora Hutch. & Dalziel
Dalbergia altissima Baker f.
Dalbergia amazonica (Radlk.) Ducke
Dalbergia amerimmon L. ex B.D.Jacks
Dalbergia andapensis Bosser & R.Rabev.
Dalbergia arbutifolia Baker
Dalbergia arbutifolia aberrans Polhill
Dalbergia armata E.Mey.
Dalbergia assamica Benth.
Dalbergia aurea Bosser & R.Rabev.
Dalbergia baronii Baker
Dalbergia bathiei R.Vig.
Dalbergia benthamii
Dalbergia berteroi
Dalbergia pseudo-sissoo Miq.
Dalbergia ovata var. glomeriflora (Kurz) Thoth.
Dalbergia albiflora subsp. albiflora"""
pattern = re.compile(r"^([A-Z][^A-Z(]+)([(A-Z].+)?$")
for entry in data.splitlines():
matcher = pattern.match(entry)
print("Name: {0:50} Author: {1}".format(*matcher.groups()))
OUTPUT
Name: Dalbergia acutifoliolata Author: Mendonca & Sousa
Name: Dalbergia adami Author: Berhaut
Name: Dalbergia afzeliana Author: G.Don
Name: Dalbergia agudeloi Author: J.Linares & M. Sousa
Name: Dalbergia albiflora Author: Hutch. & Dalziel
Name: Dalbergia altissima Author: Baker f.
Name: Dalbergia amazonica Author: (Radlk.) Ducke
Name: Dalbergia amerimmon Author: L. ex B.D.Jacks
Name: Dalbergia andapensis Author: Bosser & R.Rabev.
Name: Dalbergia arbutifolia Author: Baker
Name: Dalbergia arbutifolia aberrans Author: Polhill
Name: Dalbergia armata Author: E.Mey.
Name: Dalbergia assamica Author: Benth.
Name: Dalbergia aurea Author: Bosser & R.Rabev.
Name: Dalbergia baronii Author: Baker
Name: Dalbergia bathiei Author: R.Vig.
Name: Dalbergia benthamii Author: None
Name: Dalbergia berteroi Author: None
Name: Dalbergia pseudo-sissoo Author: Miq.
Name: Dalbergia ovata var. glomeriflora Author: (Kurz) Thoth.
Name: Dalbergia albiflora subsp. albiflora Author: None
Upvotes: 3