Reputation: 65
What I'm looking to do is group strings together off of a fiction website. The titles of the posts are generally in the format something like:
titles = ['Series Name: Part 1 - This is the chapter name',
'[OC] Series Name - Part 2 - Another name with the word chapter and extra oc at the start',
"[OC] Series Name = part 3 = punctuation could be not matching, so we can't always trust common substrings",
'{OC} Another cool story - Part I - This is the chapter name',
'{OC} another cool story: part II: another post title',
'{OC} another cool story part III but the author forgot delimiters',
"this is a one-off story, so it doesn't have any friends"]
Delimiters etc aren't always there, and there can be some variation.
I'd start by normalizing the string to just alphanumeric characters.
import re
from pprint import pprint as pp
titles = [] # from above
normalized = []
for title in titles:
title = re.sub(r'\bOC\b', '', title)
title = re.sub(r'[^a-zA-Z0-9\']+', ' ', title)
title = title.strip()
normalized.append(title)
pp(normalized)
which gives
['Series Name Part 1 This is the chapter name',
'Series Name Part 2 Another name with the word chapter and extra oc at the start',
"Series Name part 3 punctuation could be not matching so we can't always trust common substrings",
'Another cool story Part I This is the chapter name',
'another cool story part II another post title',
'another cool story part III but the author forgot delimiters',
"this is a one off story so it doesn't have any friends"]
The output I'm hoping for is:
['Series Name',
'Another cool story',
"this is a one-off story, so it doesn't have any friends"] # last element optional
I know of a few different ways to compare strings...
difflib.SequenceMatcher.ratio()
I've also heard of Jaro-Winkler and FuzzyWuzzy.
But all that really matters is that we can get a number showing the similarity between the strings.
I'm thinking I need to come up with (most of) a 2D matrix comparing each string to each other. But once I've got that, I can't wrap my head around how to actually separate them into groups.
I found another post that seems to have done the first part... but then I'm not sure how to continue from there.
scipy.cluster looked promising at first... but then I was in way over my head.
Another thought was somehow combining itertools.combinations() with functools.reduce() with one of the above distance metrics.
Am I way overthinking things? It seems like this should be simple, but it's just not making sense in my head.
Upvotes: 4
Views: 2669
Reputation: 169284
This is an implementation of the ideas put forth in CKM's answer: https://stackoverflow.com/a/61671971/42346
First take out the punctuation -- it's not important to your purpose -- using this answer: https://stackoverflow.com/a/15555162/42346
Then we'll use one of the techniques described here: https://blog.eduonix.com/artificial-intelligence/clustering-similar-sentences-together-using-machine-learning/ to cluster similar sentences.
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') # only alphanumeric characters
lol_tokenized = []
for title in titles:
lol_tokenized.append(tokenizer.tokenize(title))
Then get a numeric representation of your titles:
import numpy as np
from gensim.models import Word2Vec
m = Word2Vec(lol_tokenized,size=50,min_count=1,cbow_mean=1)
def vectorizer(sent,m):
vec = []
numw = 0
for w in sent:
try:
if numw == 0:
vec = m[w]
else:
vec = np.add(vec, m[w])
numw += 1
except Exception as e:
print(e)
return np.asarray(vec) / numw
l = []
for i in lol_tokenized:
l.append(vectorizer(i,m))
X = np.array(l)
Whoo-boy that was a lot.
Now you have to do the clustering.
from sklearn.cluster import KMeans
clf = KMeans(n_clusters=2,init='k-means++',n_init=100,random_state=0)
labels = clf.fit_predict(X)
print(labels)
for index, sentence in enumerate(lol_tokenized):
print(str(labels[index]) + ":" + str(sentence))
[1 1 0 1 0 0 0]
1:['Series', 'Name', 'Part', '1', 'This', 'is', 'the', 'chapter', 'name']
1:['OC', 'Series', 'Name', 'Part', '2', 'Another', 'name', 'with', 'the', 'word', 'chapter', 'and', 'extra', 'oc', 'at', 'the', 'start']
0:['OC', 'Series', 'Name', 'part', '3', 'punctuation', 'could', 'be', 'not', 'matching', 'so', 'we', 'can', 't', 'always', 'trust', 'common', 'substrings']
1:['OC', 'Another', 'cool', 'story', 'Part', 'I', 'This', 'is', 'the', 'chapter', 'name']
0:['OC', 'another', 'cool', 'story', 'part', 'II', 'another', 'post', 'title']
0:['OC', 'another', 'cool', 'story', 'part', 'III', 'but', 'the', 'author', 'forgot', 'delimiters']
0:['this', 'is', 'a', 'one', 'off', 'story', 'so', 'it', 'doesn', 't', 'have', 'any', 'friends']
Then you can pull out the ones with index == 1:
for index, sentence in enumerate(lol_tokenized):
if labels[index] == 1:
print(sentence)
['Series', 'Name', 'Part', '1', 'This', 'is', 'the', 'chapter', 'name']
['OC', 'Series', 'Name', 'Part', '2', 'Another', 'name', 'with', 'the', 'word', 'chapter', 'and', 'extra', 'oc', 'at', 'the', 'start']
['OC', 'Another', 'cool', 'story', 'Part', 'I', 'This', 'is', 'the', 'chapter', 'name']
Upvotes: 4
Reputation: 1971
Your task falls into what is known as semantic similarity
. I propose you proceed as follows:
Upvotes: 1