Reputation: 1161
The Italian laws are officially published in the Gazzetta Ufficiale and I am trying to identify company names with the following regex:
azienda|societa'\s+([\w\s-]+) ha
which matches decently fragments such as:
Vista la domanda presentata in data 26 febbraio 2021 con la quale
la societa' Orpha-Devel Handels Und Vertriebs GMBH ha chiesto la
riclassificazione dalla classe C(nn) alla classe H del medicinale
«Tresuvi» (treprostinil) relativamente alle confezioni aventi A.I.C.
n. 049207032, 049207044, 049207018 e 049207020;
returning the string "Orpha-Devel Handels Und Vertriebs GMBH " in the matching group.In this case to be "perfect" I just want the trailing blanks (usually one or two) to not be included in the returned matching group.
Upvotes: 1
Views: 51
Reputation: 626806
You can use
(?:azienda|societa)'\s+(\w+(?:[\s-]+\w+)*)\s+ha
(?:azienda|societa)'\s+(.*?)\s+ha
See the regex demo #1 and regex demo #2.
Note that you should group azienda
and societa
or, the capturing group will only match with societa
, but not azienda
.
Details:
(?:azienda|societa)
- either azienda
or societa
'
- a '
char\s+
- one or more whitespaces(\w+(?:[\s-]+\w+)*)
- Group 1: one or more word chars and then zero or more repetitions of one or more whitespaces/hyphen chars and then one or more word chars(.*?)
- Group 1: any zero or more chars other than line break chars, as few as possible\s+
- one or more whitespacesha
- a ha
string.Upvotes: 1