Reputation: 273
I've been struggeling with parsing a company name from the domain and page title in HTML. Let's say my domain is:
http://thisismycompany.com
and the page title is:
This is an example page title | My Company
My hypothesis is that when I match the longest common substring from these, after lowercasing and removing all but alphanumeric, this is very likely to be the company name.
So a longest common substring (Link to python 3 code) would return mycompany
. How would I go about matching this substring back to the original page title so that I can retrieve the correct locations for whitespaces and upercase charachters.
Upvotes: 1
Views: 1321
Reputation: 273
I have solved it by generating a list of all possible substrings of the title. Then matching this with the match I got from the longest common substring function.
def get_all_substrings(input_string):
length = len(input_string)
return set([input_string[i:j+1] for i in range(length) for j in range(i,length)])
longest_substring_match = 'mycompany'
page_title = 'This is an example page title | My Company'
for substring in get_all_substrings(page_title):
if re.sub('[^0-9a-zA-Z]+', '', substring).lower() == longest_substring_match.lower():
match = substring
break
print(match)
Edit: source used
Upvotes: 1
Reputation: 47
I considered whether this would be doable using regex, but I figured it would be easier to just use normal string manipulation / comparison, especially because this doesn't seem like a time-sensitive task.
def find_name(normalized_name, full_name_container):
n = 0
full_name = ''
for i in range(0, len(full_name_container)):
if n == len(normalized_name):
return full_name
# If the characters at the current position in both
# strings match, add the proper case to the final string
# and move onto the next character
if (normalized_name[n]).upper() == (full_name_container[i]).upper():
full_name += full_name_container[i]
n += 1
# If the name is interrupted by a separator, add that to the result
elif full_name_container[i] in ['-', '_', '.', ' ']:
full_name += full_name_container[i]
# If a character is encountered that is definitely not part of the name
# Re-start the search
else:
n = 0
full_name = ''
return full_name
print(find_name('mycompany', 'Some stuff My Company Some Stuff'))
This should print out "My Company". Hard coding a list of possible items like spaces and commas that could interrupt the normalized name is probably something you'll have to improve.
Upvotes: 1