Solanum tuberosum
Solanum tuberosum

Reputation: 27

Use regular expression to find species names and author names

I´m struggling formulating a regular expression to extract all the species names (group1) and the author names (group2) from a list. I´m fairly new to python and would appreciate any help.

This is a part of the list:

Dalbergia acutifoliolata Mendonca & Sousa
Dalbergia adami Berhaut
Dalbergia afzeliana G.Don
Dalbergia agudeloi J.Linares & M. Sousa
Dalbergia albiflora Hutch. & Dalziel
Dalbergia altissima Baker f.
Dalbergia amazonica (Radlk.) Ducke
Dalbergia amerimmon L. ex B.D.Jacks
Dalbergia andapensis Bosser & R.Rabev.
Dalbergia arbutifolia Baker
Dalbergia arbutifolia aberrans Polhill
Dalbergia armata E.Mey.
Dalbergia assamica Benth.
Dalbergia aurea Bosser & R.Rabev.
Dalbergia baronii Baker
Dalbergia bathiei R.Vig.
Dalbergia benthamii
Dalbergia berteroi
Dalbergia pseudo-sissoo Miq.
Dalbergia ovata var. glomeriflora (Kurz) Thoth.
Dalbergia albiflora subsp. albiflora

Usually species names have a genus and a species name, and some have a subspecies name. I can catch those with:

([A-Z][a-z]*[\s]{1}[a-z]*|[A-Z][a-z]*[\s]{1}[a-z]*[\s]{1}[a-z]*)

I don´t catch some exceptions like "Dalbergia pseudo-sissoo" because of the "-". And I don´t know how to handle varieties "var." or cases where subspecies are indicated with "subsp."

I cannot come up with a regex to handle the complex structure of the authors. They always start with a capital letter or "(".

Some entries do not have authors, I still want the species entries to be returned.

This is my attempt so far, but doesn´t get everything I want:

([A-Z][a-z]*[\s]{1}[a-z]*|[A-Z][a-z]*[\s]{1}[a-z]*[\s]{1}[a-z]*)\s([A-Z][a-z]*|[(][A-Z][a-z]*|[\0])

It does match with any genus name, if the line before didn´t have an author name.

Thank you in advance for your help!

Upvotes: 1

Views: 610

Answers (3)

Sandeep Gusain
Sandeep Gusain

Reputation: 127

You can try this:

(^\w+.*?)(?:([A-Z(].*)|$)
  1. (^\w+.*?)

    Capture the words/characters before the first or no occurrence of author name(which can start with a capital letter or a left round bracket).

  2. (?:([A-Z(].*)|$)

    Capture the author name starting with an uppercase letter or a round bracket if it exists else match if it is the end of the string

Regex Demo

Upvotes: 1

MonkeyZeus
MonkeyZeus

Reputation: 20737

This should do it:

^(?P<species>[A-Z][^A-Z(]+)(?P<author>(?<!^).*)$
  • ^ - assert start of line
  • (?P<species>[A-Z][^A-Z(]+) - named capture group "species" must start with one capital letter and then fetch everything not a capital nor open parenthesis
  • (?P<author>(?<!^).*) - named capture group "author" cannot be at the start of the line and capture everything till the end of the line
  • $ - assert end of line

https://regex101.com/r/wzDSXE/1/

Upvotes: 4

Chris Doyle
Chris Doyle

Reputation: 12027

What about regex like ^([A-Z][^A-Z(]+)([(A-Z].+)?$

import re


data = """Dalbergia acutifoliolata Mendonca & Sousa
Dalbergia adami Berhaut
Dalbergia afzeliana G.Don
Dalbergia agudeloi J.Linares & M. Sousa
Dalbergia albiflora Hutch. & Dalziel
Dalbergia altissima Baker f.
Dalbergia amazonica (Radlk.) Ducke
Dalbergia amerimmon L. ex B.D.Jacks
Dalbergia andapensis Bosser & R.Rabev.
Dalbergia arbutifolia Baker
Dalbergia arbutifolia aberrans Polhill
Dalbergia armata E.Mey.
Dalbergia assamica Benth.
Dalbergia aurea Bosser & R.Rabev.
Dalbergia baronii Baker
Dalbergia bathiei R.Vig.
Dalbergia benthamii
Dalbergia berteroi
Dalbergia pseudo-sissoo Miq.
Dalbergia ovata var. glomeriflora (Kurz) Thoth.
Dalbergia albiflora subsp. albiflora"""

pattern = re.compile(r"^([A-Z][^A-Z(]+)([(A-Z].+)?$")
for entry in data.splitlines():
    matcher = pattern.match(entry)
    print("Name: {0:50} Author: {1}".format(*matcher.groups()))

OUTPUT

Name: Dalbergia acutifoliolata                           Author: Mendonca & Sousa
Name: Dalbergia adami                                    Author: Berhaut
Name: Dalbergia afzeliana                                Author: G.Don
Name: Dalbergia agudeloi                                 Author: J.Linares & M. Sousa
Name: Dalbergia albiflora                                Author: Hutch. & Dalziel
Name: Dalbergia altissima                                Author: Baker f.
Name: Dalbergia amazonica                                Author: (Radlk.) Ducke
Name: Dalbergia amerimmon                                Author: L. ex B.D.Jacks
Name: Dalbergia andapensis                               Author: Bosser & R.Rabev.
Name: Dalbergia arbutifolia                              Author: Baker
Name: Dalbergia arbutifolia aberrans                     Author: Polhill
Name: Dalbergia armata                                   Author: E.Mey.
Name: Dalbergia assamica                                 Author: Benth.
Name: Dalbergia aurea                                    Author: Bosser & R.Rabev.
Name: Dalbergia baronii                                  Author: Baker
Name: Dalbergia bathiei                                  Author: R.Vig.
Name: Dalbergia benthamii                                Author: None
Name: Dalbergia berteroi                                 Author: None
Name: Dalbergia pseudo-sissoo                            Author: Miq.
Name: Dalbergia ovata var. glomeriflora                  Author: (Kurz) Thoth.
Name: Dalbergia albiflora subsp. albiflora               Author: None

Upvotes: 3

Related Questions