Reputation: 13
I need to find recurring patterns in strings within a list, and then remove those patterns from the strings.
The point is to strip website names of the document title, such that Amet urna tincidunt efficitur - The Guardian
becomes only Amet urna tincidunt efficitur
.
Using regex to do this is simple. But the problem is that the specific pattern is not known beforehand, just that it keeps recurring.
Example data:
data = ["Amet urna tincidunt efficitur - The Guardian",
"Yltricies hendrerit eu a nisi - The Guardian",
"Faucibus pharetra id quis arck - The Guardian",
"Net tristique facilisis | New York Times",
"Quis finibus lacinia | New York Times",
"My blog: Net tristique facilisis",
"My blog: Quis finibus lacinia"]
We can easily see that the substrings - The Guardian
, | New York Times
and My blog:
keeps recurring. How do I identify these recurring patterns dynamically, and then remove them?
The expected output:
data = ["Amet urna tincidunt efficitur",
"Yltricies hendrerit eu a nisi",
"Faucibus pharetra id quis arck",
"Net tristique facilisis",
"Quis finibus lacinia",
"Net tristique facilisis",
"Quis finibus lacinia"]
Upvotes: 1
Views: 117
Reputation: 2135
You could iteratively look for commonly occurring patterns and create a list of the most common ones to remove them. It sounds like you have a large enough data set that it's unlikely to be 100% correct on this.
Since you mentioned patterns only occur at the beginning or end, you could do something like this:
from collections import Counter
data = [
"Amet urna tincidunt efficitur - The Guardian",
"Yltricies hendrerit eu a nisi - The Guardian",
"Faucibus pharetra id quis arck - The Guardian",
"Net tristique facilisis | New York Times",
"Quis finibus lacinia | New York Times",
"My blog: Net tristique facilisis",
"My blog: Quis finibus lacinia",
]
def find_common(data, num_phrases=50):
phrases = Counter()
for sentence in data:
for n in range(2, 6):
phrases[" ".join(sentence.split()[:n])] += 1
phrases[" ".join(sentence.split()[-n:])] += 1
return phrases.most_common(num_phrases)
find_common(data, 8)
Out[145]:
[('The Guardian', 3),
('- The Guardian', 3),
('York Times', 2),
('Net tristique facilisis', 2),
('New York Times', 2),
('| New York Times', 2),
('Quis finibus lacinia', 2),
('My blog:', 2)]
From there, you could pick out that '- The Guardian', '| New York Times', and 'My blog:' are common web page name patterns. You could then remove those from your data and run it again, iterating over it until you feel like you got most of them.
Upvotes: 1
Reputation: 326
Basically, do you want something that filters the words that most occur in a correct set of documents? You can simply use the CountVectorizer from sklearn with the cutting parameter you want. This is done using the max_df parameter. By the documentation (CountVectorizer Documentation) description max_df determines the following:
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
With this, you can ignore words with certain frequency. So, then just do the reverse process to eliminate words that exceed the limit that you want.
Exemple:
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
data = ["Amet urna tincidunt efficitur - The Guardian",
"Yltricies hendrerit eu a nisi - The Guardian",
"Faucibus pharetra id quis arck - The Guardian",
"Net tristique facilisis | New York Times",
"Quis finibus lacinia | New York Times"]
vectorizer = CountVectorizer(max_df=0.3, lowercase=False, strip_accents=None)
X = vectorizer.fit_transform(data)
vocab = vectorizer.vocabulary_
cv_matrix = X.todense()
new_data = []
for idx_t, text in enumerate(data):
tokens = word_tokenize(text)
cv_matrix_ = cv_matrix[idx_t].tolist()[0]
new_text = []
for tok_ in tokens:
if tok_ in vocab.keys():
new_text.append(tok_)
new_data.append(" ".join(new_text))
Result:
>>> new_data
['Amet urna tincidunt efficitur',
'Yltricies hendrerit eu nisi',
'Faucibus pharetra id quis arck',
'Net tristique facilisis',
'Quis finibus lacinia']
Upvotes: 1