How do I dynamically identify recurring patterns in a list, and then remove them?

Question

I need to find recurring patterns in strings within a list, and then remove those patterns from the strings.

The point is to strip website names of the document title, such that Amet urna tincidunt efficitur - The Guardian becomes only Amet urna tincidunt efficitur.

Using regex to do this is simple. But the problem is that the specific pattern is not known beforehand, just that it keeps recurring.

Example data:

data = ["Amet urna tincidunt efficitur - The Guardian",
        "Yltricies hendrerit eu a nisi - The Guardian",
        "Faucibus pharetra id quis arck - The Guardian",
        "Net tristique facilisis | New York Times",
        "Quis finibus lacinia | New York Times",
        "My blog: Net tristique facilisis",
        "My blog: Quis finibus lacinia"]

We can easily see that the substrings - The Guardian, | New York Times and My blog: keeps recurring. How do I identify these recurring patterns dynamically, and then remove them?

The expected output:

data = ["Amet urna tincidunt efficitur",
        "Yltricies hendrerit eu a nisi",
        "Faucibus pharetra id quis arck",
        "Net tristique facilisis",
        "Quis finibus lacinia",
        "Net tristique facilisis",
        "Quis finibus lacinia"]

Christian Gomes · Accepted Answer

Basically, do you want something that filters the words that most occur in a correct set of documents? You can simply use the CountVectorizer from sklearn with the cutting parameter you want. This is done using the max_df parameter. By the documentation (CountVectorizer Documentation) description max_df determines the following:

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).

With this, you can ignore words with certain frequency. So, then just do the reverse process to eliminate words that exceed the limit that you want.

Exemple:

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

data = ["Amet urna tincidunt efficitur - The Guardian",
        "Yltricies hendrerit eu a nisi - The Guardian",
        "Faucibus pharetra id quis arck - The Guardian",
        "Net tristique facilisis | New York Times",
        "Quis finibus lacinia | New York Times"]

vectorizer = CountVectorizer(max_df=0.3, lowercase=False, strip_accents=None)
X = vectorizer.fit_transform(data)

vocab = vectorizer.vocabulary_
cv_matrix = X.todense()
new_data = []

for idx_t, text in enumerate(data):
    tokens = word_tokenize(text)
    cv_matrix_ = cv_matrix[idx_t].tolist()[0]
    new_text = []

    for tok_ in tokens:
        if tok_ in vocab.keys():
            new_text.append(tok_)

    new_data.append(" ".join(new_text))

Result:

>>> new_data
['Amet urna tincidunt efficitur',
 'Yltricies hendrerit eu nisi',
 'Faucibus pharetra id quis arck',
 'Net tristique facilisis',
 'Quis finibus lacinia']

How do I dynamically identify recurring patterns in a list, and then remove them?

Answers (2)

Related Questions