Python - Parsing company name from Domain and Page Title

Question

I've been struggeling with parsing a company name from the domain and page title in HTML. Let's say my domain is:

http://thisismycompany.com

and the page title is:

This is an example page title | My Company

My hypothesis is that when I match the longest common substring from these, after lowercasing and removing all but alphanumeric, this is very likely to be the company name.

So a longest common substring (Link to python 3 code) would return mycompany. How would I go about matching this substring back to the original page title so that I can retrieve the correct locations for whitespaces and upercase charachters.

CBeltz · Accepted Answer

I considered whether this would be doable using regex, but I figured it would be easier to just use normal string manipulation / comparison, especially because this doesn't seem like a time-sensitive task.

def find_name(normalized_name, full_name_container):
  n = 0
  full_name = ''
  for i in range(0, len(full_name_container)):
    if n == len(normalized_name):
      return full_name

    # If the characters at the current position in both
    # strings match, add the proper case to the final string
    # and move onto the next character
    if (normalized_name[n]).upper() == (full_name_container[i]).upper():
      full_name += full_name_container[i]
      n += 1

    # If the name is interrupted by a separator, add that to the result  
    elif full_name_container[i] in ['-', '_', '.', ' ']:
      full_name += full_name_container[i]

    # If a character is encountered that is definitely not part of the name
    # Re-start the search
    else:
      n = 0
      full_name = ''

  return full_name

print(find_name('mycompany', 'Some stuff My Company Some Stuff'))

This should print out "My Company". Hard coding a list of possible items like spaces and commas that could interrupt the normalized name is probably something you'll have to improve.

Python - Parsing company name from Domain and Page Title

Answers (2)

Related Questions