taga
taga

Reputation: 3885

Regex for detection company names in Python

I want to detect company names with regex by using Python.

This is my idea:

  1. Company name should have between 1 and 3 words
  2. First word in company name should be capitalized
  3. One of words in company name can have .com or .co (Amazon.com Inc)
  4. Last word of company name (fourth word) should be Inc. , Ltd, GmbH, AG, GmbH, Group, Holding etc.
  5. Between last word of name and Inc. , Ltd, GmbH, AG sometimes can be ',' or ', '

I have tried something like this but it does not work:

address_1 = 'I work in Amazon.com Inc.'
address_2 = 'Company named Swiss Medic Holding invested in vaccine'
address_3 = 'what do you think about Abercrombie & Fitch Co. ?'
address_4 = 'do you work in Delta Group?'
address_5 = 'I have worked in CocaCola Gmbh'

regex_company = '([A-Z][\w]+[ -]+){1,3}(Ltd|ltd|LTD|llc|LLC|Inc|inc|INC|plc|Corp|Group)'
found = re.search(regex_company, address)

And I want to print results of detected companies I have used same regex logic to find street addresses and it works good, but for company names it does not. This is the regex that I have used:

regex_street = "(\d{0,6})(?:\w)\s([A-Z][\w]+[ -]+){1,3}(Street|St|Road|Rd)

Regex logic: number + 1-3 words + street/st/road/rd

Upvotes: 0

Views: 2495

Answers (2)

Ryszard Czech
Ryszard Czech

Reputation: 18611

Use

\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    co                       'co'
--------------------------------------------------------------------------------
    m?                       'm' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )?                       end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (between 0 and 2
                           times (matching the most amount
                           possible)):
--------------------------------------------------------------------------------
    [ -]+                    any character of: ' ', '-' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      &                        '&'
--------------------------------------------------------------------------------
      [ -]+                    any character of: ' ', '-' (1 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      \.                       '.'
--------------------------------------------------------------------------------
      co                       'co'
--------------------------------------------------------------------------------
      m?                       'm' (optional (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
  ){0,2}                   end of grouping
--------------------------------------------------------------------------------
  [,\s]+                   any character of: ',', whitespace (\n, \r,
                           \t, \f, and " ") (1 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  (?i:                     group, but do not capture (case-
                           insensitive) (with ^ and $ matching
                           normally) (with . not matching \n)
                           (matching whitespace and # normally):
--------------------------------------------------------------------------------
    ltd                      'ltd'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    llc                      'llc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    inc                      'inc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    plc                      'plc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    co                       'co'
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      rp                       'rp'
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    group                    'group'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    holding                  'holding'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    gmbh                     'gmbh'
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

Python code:

import re

regex = r"\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b"

test_str = ("I work in Amazon.com Inc.\n"
    "Company named Swiss Medic Holding invested in vaccine\n"
    "what do you think about Abercrombie & Fitch Co. ?\n"
    "do you work in Delta Group?\n"
    "I have worked in CocaCola Gmbh")

print(re.findall(regex, test_str))

Results: ['Amazon.com Inc', 'Swiss Medic Holding', 'Abercrombie & Fitch Co', 'Delta Group', 'CocaCola Gmbh']

Upvotes: 2

Hammurabi
Hammurabi

Reputation: 1169

Use https://regex101.com for testing out regex, it's great. For your specific example, here is regex that does what you want. I don't see the need to test for the optional .com in this example.

regex_company = '[A-Z]([^ ]*[ &]*){0,2}(Inc\.|Ltd|GmbH|AG|Gmbh|Group|Holding|Co\.)'

for address in [address_1, address_2, address_3, address_4, address_5]:
    found = re.search(regex_company, address)
    if found:
        print(found)

# prints:
# <regex.Match object; span=(10, 25), match='Amazon.com Inc.'>
# <regex.Match object; span=(14, 33), match='Swiss Medic Holding'>
# <regex.Match object; span=(24, 47), match='Abercrombie & Fitch Co.'>
# <regex.Match object; span=(15, 26), match='Delta Group'>
# <regex.Match object; span=(17, 30), match='CocaCola Gmbh'>

Upvotes: 0

Related Questions