Reputation: 561
I'm trying to mimick the n_gram parameter in CountVectorizer() with gensim. My goal is to be able to use LDA with Scikit or Gensim and to find very similar bigrams.
For example, we can find the following bigrams with scikit: "abc computer", "binary unordered" and with gensim "A survey", "Graph minors"...
I have attached my code below to make a comparison between Gensim and Scikit in terms of bigrams/unigrams.
Thanks for your help
documents = [["Human" ,"machine" ,"interface" ,"for" ,"lab", "abc" ,"computer" ,"applications"],
["A", "survey", "of", "user", "opinion", "of", "computer", "system", "response", "time"],
["The", "EPS", "user", "interface", "management", "system"],
["System", "and", "human", "system", "engineering", "testing", "of", "EPS"],
["Relation", "of", "user", "perceived", "response", "time", "to", "error", "measurement"],
["The", "generation", "of", "random", "binary", "unordered", "trees"],
["The", "intersection", "graph", "of", "paths", "in", "trees"],
["Graph", "minors", "IV", "Widths", "of", "trees", "and", "well", "quasi", "ordering"],
["Graph", "minors", "A", "survey"]]
With the gensim model we find 48 unique tokens, we can print the unigram/bigrams with print(dictionary.token2id)
# 1. Gensim
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(documents, min_count=1)
for idx in range(len(documents)):
for token in bigram[documents[idx]]:
if '_' in token:
# Token is a bigram, add to document.
documents[idx].append(token)
documents = [[doc.replace("_", " ") for doc in docs] for docs in documents]
print(documents)
dictionary = corpora.Dictionary(documents)
print(dictionary.token2id)
And with the scikit 96 unique tokens, we can print scikit's vocabulary with print(vocab)
# 2. Scikit
import re
token_pattern = re.compile(r"\b\w\w+\b", re.U)
def custom_tokenizer( s, min_term_length = 1 ):
"""
Tokenizer to split text based on any whitespace, keeping only terms of at least a certain length which start with an alphabetic character.
"""
return [x.lower() for x in token_pattern.findall(s) if (len(x) >= min_term_length and x[0].isalpha() ) ]
from sklearn.feature_extraction.text import CountVectorizer
def preprocess(docs, min_df = 1, min_term_length = 1, ngram_range = (1,1), tokenizer=custom_tokenizer ):
"""
Preprocess a list containing text documents stored as strings.
doc : list de string (pas tokenizé)
"""
# Build the Vector Space Model, apply TF-IDF and normalize lines to unit length all in one call
vec = CountVectorizer(lowercase=True,
strip_accents="unicode",
tokenizer=tokenizer,
min_df = min_df,
ngram_range = ngram_range,
stop_words = None
)
X = vec.fit_transform(docs)
vocab = vec.get_feature_names()
return (X,vocab)
docs_join = list()
for i in documents:
docs_join.append(' '.join(i))
(X, vocab) = preprocess(docs_join, ngram_range = (1,2))
print(vocab)
Upvotes: 0
Views: 1535
Reputation: 2399
gensim
Phrases
class is designed to "Automatically detect common phrases (multiword expressions) from a stream of sentences."
So it only gives you bigrams that "appear more frequently than expected". That's why with gensim
package you only get a few bigrams like : 'response time'
, 'Graph minors'
, 'A survey'
.
If you look at bigram.vocab
you'll see that these bigrams appear 2 times whereas all others bigrams appear only one time.
scikit-learn
's CountVectorizer
class gives you all bigrams.
Upvotes: 1