Robert Alexander
Robert Alexander

Reputation: 1161

How to identify companies names in laws via regex

The Italian laws are officially published in the Gazzetta Ufficiale and I am trying to identify company names with the following regex:

azienda|societa'\s+([\w\s-]+) ha

which matches decently fragments such as:

Vista la domanda presentata in data 26 febbraio 2021 con  la  quale
la societa' Orpha-Devel Handels Und  Vertriebs  GMBH  ha  chiesto  la
riclassificazione dalla classe C(nn) alla  classe  H  del  medicinale
«Tresuvi» (treprostinil) relativamente alle confezioni aventi  A.I.C.
n. 049207032, 049207044, 049207018 e 049207020;

returning the string "Orpha-Devel Handels Und Vertriebs GMBH " in the matching group.In this case to be "perfect" I just want the trailing blanks (usually one or two) to not be included in the returned matching group.

Upvotes: 1

Views: 51

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626806

You can use

(?:azienda|societa)'\s+(\w+(?:[\s-]+\w+)*)\s+ha
(?:azienda|societa)'\s+(.*?)\s+ha

See the regex demo #1 and regex demo #2.

Note that you should group azienda and societa or, the capturing group will only match with societa, but not azienda.

Details:

  • (?:azienda|societa) - either azienda or societa
  • ' - a ' char
  • \s+ - one or more whitespaces
  • (\w+(?:[\s-]+\w+)*) - Group 1: one or more word chars and then zero or more repetitions of one or more whitespaces/hyphen chars and then one or more word chars
  • (.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
  • \s+ - one or more whitespaces
  • ha - a ha string.

Upvotes: 1

Related Questions