Reputation: 3082
I am trying to create a spark ml kmeans model with the below code and passing a dataframe to get the the clusters
def pre_process_data_for_kmean(dataframe):
train_data = dataframe.select(col("custid"),col("amount").cast("double").alias("amnt"),col("trantype"),((col("trantime"))).cast("double").alias("date_time"))
cat1Indexer = StringIndexer(inputCol="custid", outputCol="indexedCat1", handleInvalid="skip")
cat2Indexer = StringIndexer(inputCol="trantype", outputCol="indexedCat2", handleInvalid="skip")
cat1Encoder = OneHotEncoder(inputCol="indexedCat1", outputCol="CatVector1")
cat2Encoder = OneHotEncoder(inputCol="indexedCat2", outputCol="CatVector2")
cat3Encoder = OneHotEncoder(inputCol="date_time",outputCol="CatVector3")
fAssembler = VectorAssembler(
inputCols=["CatVector1","CatVector2","CatVector3","amnt"],
outputCol="C5")
cluster_model = KMeans(k=10, seed=1,featuresCol="C5")
cluster_pipeline = Pipeline(stages=[cat1Indexer, cat1Encoder,cat2Indexer,cat2Encoder,cat3Encoder,fAssembler])
cluster_model = cluster_pipeline.fit(train_data)
return cluster_model
I am passing the data frame as
train_df = raw_train_df.select(col("dSc").alias("custid"),col("TranAmount").alias("amount"),col("TranDescription").alias("trantype"),func.dayofmonth(col("BusinessDate")).alias("trantime")).na.fill({'trantype':'new_tran_type','custid':'-99999','amount':0,'trantime':1}).dropna()
cluster_model = pre_process_data_for_kmean(train_df)
Now I understand that oneHotEncoder does not accept empty string and I have already takes measures to counter that as you can see. but still I am facing this error
Please assist .
Upvotes: 2
Views: 2728
Reputation: 330073
Empty string is literally and empty string not NULL
. Neither na.fill
nor dropna
will help. You can use na.replace
but as far as I know it has not columnwise equivalent so you'll have to call it for each column:
replacements = {
'some_col': 'some_replacement', 'another_col': 'another_replacement',
'numeric_column_wont_be_replaced': 1.0
}
for k, v in replacements.items():
# We can replace string only if target is string
# In Python 2 str -> basestring
if isinstance(v, str):
df = df.na.replace("", v, [k])
Upvotes: 3