glouis
glouis

Reputation: 561

Trying to mimick Scikit ngram with gensim

I'm trying to mimick the n_gram parameter in CountVectorizer() with gensim. My goal is to be able to use LDA with Scikit or Gensim and to find very similar bigrams.

For example, we can find the following bigrams with scikit: "abc computer", "binary unordered" and with gensim "A survey", "Graph minors"...

I have attached my code below to make a comparison between Gensim and Scikit in terms of bigrams/unigrams.

Thanks for your help

documents = [["Human" ,"machine" ,"interface" ,"for" ,"lab", "abc" ,"computer" ,"applications"],
      ["A", "survey", "of", "user", "opinion", "of", "computer", "system", "response", "time"],
      ["The", "EPS", "user", "interface", "management", "system"],
      ["System", "and", "human", "system", "engineering", "testing", "of", "EPS"],
      ["Relation", "of", "user", "perceived", "response", "time", "to", "error", "measurement"],
      ["The", "generation", "of", "random", "binary", "unordered", "trees"],
      ["The", "intersection", "graph", "of", "paths", "in", "trees"],
      ["Graph", "minors", "IV", "Widths", "of", "trees", "and", "well", "quasi", "ordering"],
      ["Graph", "minors", "A", "survey"]]

With the gensim model we find 48 unique tokens, we can print the unigram/bigrams with print(dictionary.token2id)

# 1. Gensim
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(documents, min_count=1)
for idx in range(len(documents)):
    for token in bigram[documents[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            documents[idx].append(token)

documents = [[doc.replace("_", " ") for doc in docs] for docs in documents]
print(documents)

dictionary = corpora.Dictionary(documents)
print(dictionary.token2id)

And with the scikit 96 unique tokens, we can print scikit's vocabulary with print(vocab)

# 2. Scikit
import re
token_pattern = re.compile(r"\b\w\w+\b", re.U)

def custom_tokenizer( s, min_term_length = 1 ):
    """
    Tokenizer to split text based on any whitespace, keeping only terms of at least a certain length which start with an alphabetic character.
    """
    return [x.lower() for x in token_pattern.findall(s) if (len(x) >= min_term_length and x[0].isalpha() ) ]

from sklearn.feature_extraction.text import CountVectorizer

def preprocess(docs, min_df = 1, min_term_length = 1, ngram_range = (1,1), tokenizer=custom_tokenizer ):
    """
    Preprocess a list containing text documents stored as strings.
    doc : list de string (pas tokenizé)
    """
    # Build the Vector Space Model, apply TF-IDF and normalize lines to unit length all in one call
    vec = CountVectorizer(lowercase=True,
                      strip_accents="unicode",
                      tokenizer=tokenizer,
                      min_df = min_df,
                      ngram_range = ngram_range,
                      stop_words = None
                     ) 
    X = vec.fit_transform(docs)
    vocab = vec.get_feature_names()

    return (X,vocab)

docs_join = list()

for i in documents:
    docs_join.append(' '.join(i))

(X, vocab) = preprocess(docs_join, ngram_range = (1,2))

print(vocab)

Upvotes: 0

Views: 1535

Answers (1)

arthur
arthur

Reputation: 2399

gensim Phrases class is designed to "Automatically detect common phrases (multiword expressions) from a stream of sentences." So it only gives you bigrams that "appear more frequently than expected". That's why with gensim package you only get a few bigrams like : 'response time', 'Graph minors', 'A survey'.

If you look at bigram.vocab you'll see that these bigrams appear 2 times whereas all others bigrams appear only one time.

scikit-learn's CountVectorizer class gives you all bigrams.

Upvotes: 1

Related Questions