LexMulier
LexMulier

Reputation: 273

Python - Parsing company name from Domain and Page Title

I've been struggeling with parsing a company name from the domain and page title in HTML. Let's say my domain is:

http://thisismycompany.com

and the page title is:

This is an example page title | My Company

My hypothesis is that when I match the longest common substring from these, after lowercasing and removing all but alphanumeric, this is very likely to be the company name.

So a longest common substring (Link to python 3 code) would return mycompany. How would I go about matching this substring back to the original page title so that I can retrieve the correct locations for whitespaces and upercase charachters.

Upvotes: 1

Views: 1321

Answers (2)

LexMulier
LexMulier

Reputation: 273

I have solved it by generating a list of all possible substrings of the title. Then matching this with the match I got from the longest common substring function.

def get_all_substrings(input_string):
    length = len(input_string)
    return set([input_string[i:j+1] for i in range(length) for j in range(i,length)])

longest_substring_match = 'mycompany'
page_title = 'This is an example page title | My Company'

for substring in get_all_substrings(page_title):
    if re.sub('[^0-9a-zA-Z]+', '', substring).lower() == longest_substring_match.lower():
        match = substring
        break

print(match)

Edit: source used

Upvotes: 1

CBeltz
CBeltz

Reputation: 47

I considered whether this would be doable using regex, but I figured it would be easier to just use normal string manipulation / comparison, especially because this doesn't seem like a time-sensitive task.

def find_name(normalized_name, full_name_container):
  n = 0
  full_name = ''
  for i in range(0, len(full_name_container)):
    if n == len(normalized_name):
      return full_name

    # If the characters at the current position in both
    # strings match, add the proper case to the final string
    # and move onto the next character
    if (normalized_name[n]).upper() == (full_name_container[i]).upper():
      full_name += full_name_container[i]
      n += 1

    # If the name is interrupted by a separator, add that to the result  
    elif full_name_container[i] in ['-', '_', '.', ' ']:
      full_name += full_name_container[i]

    # If a character is encountered that is definitely not part of the name
    # Re-start the search
    else:
      n = 0
      full_name = ''

  return full_name

print(find_name('mycompany', 'Some stuff My Company Some Stuff'))

This should print out "My Company". Hard coding a list of possible items like spaces and commas that could interrupt the normalized name is probably something you'll have to improve.

Upvotes: 1

Related Questions