Reputation: 275
I want to classify two groups of documents using n-grams. One approach is to extract the important words of each document using tfidf
, and then make a csv file like below:
document, ngram1, ngram2, ngram3, ..., label
1, 0.0, 0.0, 0.0, ..., 0
2, 0.0, 0.0, 0.0, ..., 1
...
But due to number of documents, the file will be huge and sparse. The other approach is to merge all documents in each group and extract the ngrams. After that, I can count the occurrence of each ngram in each document but I'm not sure this is the best way. Please provide your suggested solution.
Upvotes: 0
Views: 244
Reputation: 359
I propose you use sklearn's tfidf vectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). It supports ngrams and is efficient in memory usage. You can easily pass the vectorizer to any sklearn classifier to build the classification model.
Upvotes: 0
Reputation: 386
There's no point in concatenating documents in your groups before extracting the ngrams - any new ngrams produced this way will not exist in any individual document.
As you rightly note, whatever tokenization method you use will result in a large, sparse matrix. This isn't necessarily a problem - whatever library you intend to use for classification probably comes with an efficient representation for storing sparse matrices, and usually for computing the tf-idf matrix for you.
You might also want to use only a subset of your ngrams as features, selecting relevant ngrams using some combination of ngram-frequency and ngram-length( the number of 'grams' in a given ngram).
Alternatively, you could use a primitive form of Latent Semantic Analysis - calculate the tf-idf matrix then reduce the number of features using Principal Component Analysis (or Singular Value Decomposition if the number of ngrams and documents is so large as to make calculating their covariance matrix space-prohibitive).
Upvotes: 2