Trying to mimick Scikit ngram with gensim

Question

I'm trying to mimick the n_gram parameter in CountVectorizer() with gensim. My goal is to be able to use LDA with Scikit or Gensim and to find very similar bigrams.

For example, we can find the following bigrams with scikit: "abc computer", "binary unordered" and with gensim "A survey", "Graph minors"...

I have attached my code below to make a comparison between Gensim and Scikit in terms of bigrams/unigrams.

Thanks for your help

documents = [["Human" ,"machine" ,"interface" ,"for" ,"lab", "abc" ,"computer" ,"applications"],
      ["A", "survey", "of", "user", "opinion", "of", "computer", "system", "response", "time"],
      ["The", "EPS", "user", "interface", "management", "system"],
      ["System", "and", "human", "system", "engineering", "testing", "of", "EPS"],
      ["Relation", "of", "user", "perceived", "response", "time", "to", "error", "measurement"],
      ["The", "generation", "of", "random", "binary", "unordered", "trees"],
      ["The", "intersection", "graph", "of", "paths", "in", "trees"],
      ["Graph", "minors", "IV", "Widths", "of", "trees", "and", "well", "quasi", "ordering"],
      ["Graph", "minors", "A", "survey"]]

With the gensim model we find 48 unique tokens, we can print the unigram/bigrams with print(dictionary.token2id)

# 1. Gensim
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(documents, min_count=1)
for idx in range(len(documents)):
    for token in bigram[documents[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            documents[idx].append(token)

documents = [[doc.replace("_", " ") for doc in docs] for docs in documents]
print(documents)

dictionary = corpora.Dictionary(documents)
print(dictionary.token2id)

And with the scikit 96 unique tokens, we can print scikit's vocabulary with print(vocab)

# 2. Scikit
import re
token_pattern = re.compile(r"\b\w\w+\b", re.U)

def custom_tokenizer( s, min_term_length = 1 ):
    """
    Tokenizer to split text based on any whitespace, keeping only terms of at least a certain length which start with an alphabetic character.
    """
    return [x.lower() for x in token_pattern.findall(s) if (len(x) >= min_term_length and x[0].isalpha() ) ]

from sklearn.feature_extraction.text import CountVectorizer

def preprocess(docs, min_df = 1, min_term_length = 1, ngram_range = (1,1), tokenizer=custom_tokenizer ):
    """
    Preprocess a list containing text documents stored as strings.
    doc : list de string (pas tokenizé)
    """
    # Build the Vector Space Model, apply TF-IDF and normalize lines to unit length all in one call
    vec = CountVectorizer(lowercase=True,
                      strip_accents="unicode",
                      tokenizer=tokenizer,
                      min_df = min_df,
                      ngram_range = ngram_range,
                      stop_words = None
                     ) 
    X = vec.fit_transform(docs)
    vocab = vec.get_feature_names()

    return (X,vocab)

docs_join = list()

for i in documents:
    docs_join.append(' '.join(i))

(X, vocab) = preprocess(docs_join, ngram_range = (1,2))

print(vocab)

Trying to mimick Scikit ngram with gensim

Answers (1)

Related Questions