Reputation: 33
I'm trying to extract the company names from press releases. As an example, below there is a snippet (in French) of a press release containing a list of seven companies ending in .inc
.
En effet, Revenu Québec avait des motifs raisonnables de croire que ces entreprises avaient utilisé de fausses factures provenant de plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704 Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado inc., Impact Technologie Environnementale inc., Les entreprises Luc Clément inc. et Transport Vrac Globe International inc.
I'm trying to extract all the names using the following code:
aa = re.findall('inc\.,? (.*?inc\.)', text)
I do manage to capture quite a few, but for some reason I can't figure, I can't extract them all. It seems trivial but it has stomped me for a few hours....
Any help is appreciated !
Upvotes: 1
Views: 1868
Reputation: 22817
Using the regex module (instead of re
) you can use this solution.
This is the original regex and only matches inc.
. This also doesn't allow company names that contain et
. See Option 2 for a more comprehensive regular expression.
[\p{Lu}\p{N}](?:(?!et)[^,])*inc\.
For a more comprehensive regular expression that also checks for other company entities such as ltd.
or sons
, you can use the following regex.
(?:et|,)[^,]*?([\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.))
Note: In some flavours of regex you can use the \K
token. This token resets the starting point of the reported match (any previously consumed characters are no longer included in the final match). If your regex engine supports the \K
token (and doesn't convert it to a literal K
), you can use the following (effectively eliminating the need for capture groups).
(?:et|,)[^,]*?\K[\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.)
^^
En effet, Revenu Québec avait des motifs raisonnables de croire que ces entreprises avaient utilisé de fausses factures provenant de plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704 Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado inc., Impact Technologie Environnementale inc., Les entreprises Luc Clément inc. et Transport Vrac Globe International inc.
Asphalte Vrac Transport inc.
9163-6704 Québec inc.
Entreprise Denis Dupré inc.
Gestion Jean M. Machado inc.
Impact Technologie Environnementale inc.
Les entreprises Luc Clément inc.
Transport Vrac Globe International inc.
[\p{Lu}\p{N}]
Match anything in the set (in this case \p{Lu}
- any uppercase character in any language (includes Unicode for uppercase French characters and numbers for number companies)(?:(?!et)[^,])*
Match the following any number of times (tempered greedy token)
(?!et)
Negative lookahead ensuring what follows does not match et
literally[^,]
Match any character except comma ,
literallyinc\.
Match inc.
literally(?:et|,)
Match either et
or comma ,
literally[^,]*?
Match any character not present in the set (any character except comma ,
any number of times, but as few as possible([\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.))
Capture the following into capture group 1
[\p{Lu}\p{N}]
Match any Unicode uppercase character or Unicode number (for number companies)[^,]*?
Match any character not present in the set (any character except comma ,
any number of times, but as few as possible\s
Match a whitespace character(?:inc\.|sons|ltd\.)
Match either of the following
inc\.
Match inc.
literallysons
Match sons
literallyltd\.
Match ltd.
literallyUsing regex module allows us to use Unicode character classes such as \p{Lu}
to ensure we also catch the possibility of company names beginning with uppercase Unicode characters such as É
.
The regular expression links (under Code) include an additional string to test against:
, Étoile Simpsons et sons, Étoile Simpsons inc., Étoile et Simpsons inc.
With this additional line added only the following strings should be caught (valid company name according to the OP's specifications):
Étoile Simpsons et sons
Étoile Simpsons inc.
Étoile et Simpsons ltd.
This presents a few challenges including:
É
.
[A-Z]
is not possible for ensuring a name begins uppercase characters.sons
, but also includes sons
(cannot stop at first match for sons
).
Étoile Simpsons et sons
for example.
sons
in Simpsons
. A natural instinct (at least in regex) might be to use \b
to assert a word boundary. As much as this might be the preferred method, it doesn't work in this case. Take the French word blésons
as an example. Using \b
will actually match in blésons
since regex engines very seldom match \b
correctly with Unicode characters even with u
flag enabled (this is why I use \s
instead).sons
appears after the company name ends (in the sentence Their sons et sons, les sons.
). It must not extend past the company name's ending.
.*?
. Making it lazy will allow it to stop at the first match instead of matching the whole sentence incorrectly.Their sons et sons, les sons.
contains all the parts of a valid company name (a word starting with an uppercase character, followed by the word sons
), but this should not match as it's not a company name.
,
before each company name, I use this to determine what is and is not a company name.Upvotes: 6
Reputation: 184071
Bit late to the party since an answer has already been accepted, but anyway, here's a solution that uses Python's built-in re
module rather than the third-party regex
module.
Your attempt correctly anchors the end of the company name on inc. but you need some way to capture the start of the name. Let's define a company name as:
Further, we'll define a word as a string of letters and/or numbers possibly containing one or more hyphens. Normally we would use \w
to represent a word character, but that doesn't include hyphens, so we'll need to match that separately.
So:
[A-Z0-9](?:\w|-)*
(?:\w|-)+
inc\.
Words are separated by white space, which we will denote as \s+
. So for #2's "optional one or more words" we must create a group that includes one or more word characters (including hyphen) followed by one or more space characters, and repeat that group zero or more times: (?:(?:\w|-)+\s+)*
So, putting it all together and adding \b
at the start make sure it starts with a whole word:
re.findall(r"\b[A-Z0-9](?:\w|-)*\s+(?:(?:\w|-)+\s+)*inc\.", text)
To extend this so you can also catch names ending with Ltd. or Sons and to also catch capitalized Inc. and make the period optional:
re.findall(r"\b[A-Z0-9](?:\w|-)*\s+(?:(?:\w|-)+\s+)*(?:[Ii]nc?|[Ll]td|[Ss]ons)(?:\.|\b)?", text)
Upvotes: 0
Reputation: 7349
This pattern appears to do the trick:
>>> string = """En effet, Revenu Québec avait des motifs raisonnables de croire que ces entreprises avaient utilisé de fausses factures provenant de plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704 Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado inc., Impact Technologie Environnementale inc., Les entreprises Luc Clément inc. et Transport Vrac Globe International inc."""
>>> pattern = r'((?:[A-Z0-9\-]\.?\w*\s?(?:[a-z0-9\-]\w*\s?)?)+ inc\.)'
>>> m = re.findall(pattern, string)
>>> print('\n'.join(m))
Asphalte Vrac Transport inc.
9163-6704 Québec inc.
Entreprise Denis Dupré inc.
Gestion Jean M. Machado inc.
Impact Technologie Environnementale inc.
Les entreprises Luc Clément inc.
Transport Vrac Globe International inc.
Explanation:
[A-Z0-9\-] # match an uppercase letter or number or dash
\.? # match optional dot
\w* # match alpha-numeric chars 0 or more times
\s? # match optional white-space
(?:[a-z0-9\-]\w*\s?)? # same again except with lowercase letters
# the ? means 0 or 1 times
inc\. # match ' inc.'
(?: ... ) # non-capturing group
( ... ) # capturing group (whole thing)
x? # match x optional
x* # in this case match x 0 or more times
x+ # match x 1 or more times
Upvotes: 1
Reputation: 129
aa = [s.strip() for s in text.split(',') if s.lower().endswith(' inc.')]
Upvotes: 0
Reputation: 1580
In this case, you can avoid using a regex
, instead try:
text.split(“,”)
and then iterate through the list
created and look for ".inc"
.
Upvotes: 0