Reputation: 33

Python 2.X : Regex to find all company names ending in ".inc"

I'm trying to extract the company names from press releases. As an example, below there is a snippet (in French) of a press release containing a list of seven companies ending in .inc.

En effet, Revenu Québec avait des motifs raisonnables de croire que ces entreprises avaient utilisé de fausses factures provenant de plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704 Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado inc., Impact Technologie Environnementale inc., Les entreprises Luc Clément inc. et Transport Vrac Globe International inc.

I'm trying to extract all the names using the following code:

aa = re.findall('inc\.,? (.*?inc\.)', text)

I do manage to capture quite a few, but for some reason I can't figure, I can't extract them all. It seems trivial but it has stomped me for a few hours....

Any help is appreciated !

Upvotes: 1

Answers (5)

ctwheels

Reputation: 22837

Brief

Using the regex module (instead of re) you can use this solution.

Code

Option 1

This is the original regex and only matches inc.. This also doesn't allow company names that contain et. See Option 2 for a more comprehensive regular expression.

See regex in use here

[\p{Lu}\p{N}](?:(?!et)[^,])*inc\.

Option 2

For a more comprehensive regular expression that also checks for other company entities such as ltd. or sons, you can use the following regex.

See regex in use here

(?:et|,)[^,]*?([\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.))

Note: In some flavours of regex you can use the \K token. This token resets the starting point of the reported match (any previously consumed characters are no longer included in the final match). If your regex engine supports the \K token (and doesn't convert it to a literal K), you can use the following (effectively eliminating the need for capture groups).

See regex in use here

(?:et|,)[^,]*?\K[\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.)
              ^^

Results

Input

En effet, Revenu Québec avait des motifs raisonnables de croire que ces entreprises avaient utilisé de fausses factures provenant de plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704 Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado inc., Impact Technologie Environnementale inc., Les entreprises Luc Clément inc. et Transport Vrac Globe International inc.

Output

Asphalte Vrac Transport inc.
9163-6704 Québec inc.
Entreprise Denis Dupré inc.
Gestion Jean M. Machado inc.
Impact Technologie Environnementale inc.
Les entreprises Luc Clément inc.
Transport Vrac Globe International inc.

Explanation

Option 1

[\p{Lu}\p{N}] Match anything in the set (in this case \p{Lu} - any uppercase character in any language (includes Unicode for uppercase French characters and numbers for number companies)
(?:(?!et)[^,])* Match the following any number of times (tempered greedy token)
- (?!et) Negative lookahead ensuring what follows does not match et literally
- [^,] Match any character except comma , literally
inc\. Match inc. literally

Option 2

(?:et|,) Match either et or comma , literally
[^,]*? Match any character not present in the set (any character except comma , any number of times, but as few as possible
([\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.)) Capture the following into capture group 1
- [\p{Lu}\p{N}] Match any Unicode uppercase character or Unicode number (for number companies)
- [^,]*?Match any character not present in the set (any character except comma , any number of times, but as few as possible
- \s Match a whitespace character
- (?:inc\.|sons|ltd\.) Match either of the following
  - inc\. Match inc. literally
  - sons Match sons literally
  - ltd\. Match ltd. literally

Notes

Regex module vs re

Using regex module allows us to use Unicode character classes such as \p{Lu} to ensure we also catch the possibility of company names beginning with uppercase Unicode characters such as É.

Catching Special Cases

The regular expression links (under Code) include an additional string to test against:

, Étoile Simpsons et sons, Étoile Simpsons inc., Étoile et Simpsons inc.

With this additional line added only the following strings should be caught (valid company name according to the OP's specifications):

Étoile Simpsons et sons
Étoile Simpsons inc.
Étoile et Simpsons ltd.

This presents a few challenges including:

Company name begins with uppercase Unicode character É.
- This means we must ensure Unicode uppercase letter compatibility, thus using something like [A-Z] is not possible for ensuring a name begins uppercase characters.
Company ends with sons, but also includes sons (cannot stop at first match for sons).
- Take the case of Étoile Simpsons et sons for example.
  - This should not end at sons in Simpsons. A natural instinct (at least in regex) might be to use \b to assert a word boundary. As much as this might be the preferred method, it doesn't work in this case. Take the French word blésons as an example. Using \b will actually match in blésons since regex engines very seldom match \b correctly with Unicode characters even with u flag enabled (this is why I use \s instead).
The word sons appears after the company name ends (in the sentence Their sons et sons, les sons.). It must not extend past the company name's ending.
- This is a great case for using lazy quantifiers i.e. .*?. Making it lazy will allow it to stop at the first match instead of matching the whole sentence incorrectly.
The string Their sons et sons, les sons. contains all the parts of a valid company name (a word starting with an uppercase character, followed by the word sons), but this should not match as it's not a company name.
- Since the OP specified a , before each company name, I use this to determine what is and is not a company name.

Upvotes: 6

kindall

Reputation: 184365

Bit late to the party since an answer has already been accepted, but anyway, here's a solution that uses Python's built-in re module rather than the third-party regex module.

Your attempt correctly anchors the end of the company name on inc. but you need some way to capture the start of the name. Let's define a company name as:

A word starting with a capital letter or a number, followed by,
Optionally one or more additional words, since a firm may have a one-word name. These need not start with an uppercase letter. Then, finally,
inc.

Further, we'll define a word as a string of letters and/or numbers possibly containing one or more hyphens. Normally we would use \w to represent a word character, but that doesn't include hyphens, so we'll need to match that separately.

So:

A word starting with a capital letter or a number: [A-Z0-9](?:\w|-)*
Zero or more additional words, each denoted as: (?:\w|-)+
inc\.

Words are separated by white space, which we will denote as \s+. So for #2's "optional one or more words" we must create a group that includes one or more word characters (including hyphen) followed by one or more space characters, and repeat that group zero or more times: (?:(?:\w|-)+\s+)*

So, putting it all together and adding \b at the start make sure it starts with a whole word:

re.findall(r"\b[A-Z0-9](?:\w|-)*\s+(?:(?:\w|-)+\s+)*inc\.", text)

To extend this so you can also catch names ending with Ltd. or Sons and to also catch capitalized Inc. and make the period optional:

re.findall(r"\b[A-Z0-9](?:\w|-)*\s+(?:(?:\w|-)+\s+)*(?:[Ii]nc?|[Ll]td|[Ss]ons)(?:\.|\b)?", text)

Upvotes: 0

Totem

Reputation: 7369

This pattern appears to do the trick:

   >>> string = """En effet, Revenu Québec avait des motifs raisonnables de croire que ces entreprises avaient utilisé de fausses factures provenant de plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704 Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado inc., Impact Technologie Environnementale inc., Les entreprises Luc Clément inc. et Transport Vrac Globe International inc."""
   >>> pattern = r'((?:[A-Z0-9\-]\.?\w*\s?(?:[a-z0-9\-]\w*\s?)?)+ inc\.)'
   >>> m = re.findall(pattern, string)
   >>> print('\n'.join(m))

   Asphalte Vrac Transport inc.
   9163-6704 Québec inc.
   Entreprise Denis Dupré inc.
   Gestion Jean M. Machado inc.
   Impact Technologie Environnementale inc.
   Les entreprises Luc Clément inc.
   Transport Vrac Globe International inc.

Explanation:

   [A-Z0-9\-] # match an uppercase letter or number or dash
   \.?        # match optional dot
   \w*        # match alpha-numeric chars 0 or more times
   \s?        # match optional white-space

   (?:[a-z0-9\-]\w*\s?)? # same again except with lowercase letters
                         # the ? means 0 or 1 times

    inc\.     # match ' inc.'
   (?: ... )  # non-capturing group
   ( ... )    # capturing group (whole thing)
   x?          # match x optional
   x*          # in this case match x 0 or more times
   x+          # match x 1 or more times

Upvotes: 1

Gamaliel

Reputation: 129

aa = [s.strip() for s in text.split(',') if s.lower().endswith(' inc.')]

Upvotes: 0

usernamenotfound

Reputation: 1580

In this case, you can avoid using a regex, instead try:

text.split(“,”)

and then iterate through the list created and look for ".inc".

Upvotes: 0

Python 2.X : Regex to find all company names ending in &quot;.inc&quot;

Answers (5)

Brief

Code

Option 1

Option 2

Results

Input

Output

Explanation

Option 1

Option 2

Notes

Regex module vs re

Catching Special Cases

Related Questions

Python 2.X : Regex to find all company names ending in ".inc"