Reputation: 25
I am trying to perform topic modelling and sentimental analysis on text data over SparkNLP. I have done all the pre-processing steps on the dataset but getting an error in LDA.
Program is:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.clustering import LDA
from pyspark.sql.functions import col, lit, concat, regexp_replace
from pyspark.sql.utils import AnalysisException
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
dataframe_new = spark.read.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true') \
.load('/home/[email protected]/Gourav/chap3/abcnews-date-text.csv')
get_tokenizers = Tokenizer(inputCol="headline_text", outputCol="get_tokens")
get_tokenized = get_tokenizers.transform(dataframe_new)
remover = StopWordsRemover(inputCol="get_tokens", outputCol="row")
get_remover = remover.transform(get_tokenized)
counter_vectorized = CountVectorizer(inputCol="row", outputCol="get_features")
getmodel = counter_vectorized.fit(get_remover)
get_result = getmodel.transform(get_remover)
idf_function = IDF(inputCol="get_features", outputCol="get_idf_feature")
train_model = idf_function.fit(get_result)
outcome = train_model.transform(get_result)
lda = LDA(k=10, maxIter=10)
model = lda.fit(outcome)
Schema of DataFrame after the IDF :
Upvotes: 2
Views: 2157
Reputation: 60319
According to the documentation, LDA includes a featuresCol
argument, with default value featuresCol='features'
, i.e. the name of the column that holds the actual features; according to your shown schema, such a column is not present in your dataframe, hence the expected error.
It is not exactly clear which column contains the features in your dataframe - get_features
or get_idf_feature
(they look identical in the sample you show); assuming it is get_idf_feature
, you should change the LDA call to:
lda = LDA(featuresCol='get_idf_feature', k=10, maxIter=10)
Spark (including pyspark) ML API has a quite distinct and different logic than, say, scikit-learn and similar frameworks; one of the differences is indeed that the features have to be all in a single column of the respective dataframe. For a general demonstration of the idea, see own answer in KMeans clustering in PySpark (it is about K-Means, but the logic is identical).
Upvotes: 3