Reputation: 2299
I am trying to use KMeans on geospatial data stored in MongoDB database using Apache Spark. The data has following format,
DataFrame[decimalLatitude: double, decimalLongitude: double, features: vector]
The code is as follows, where inputdf
is the dataframe.
vecAssembler = VectorAssembler(
inputCols=["decimalLatitude", "decimalLongitude"],
outputCol="features")
inputdf = vecAssembler.transform(inputdf)
kmeans = KMeans(k = 10, seed = 123)
model = kmeans.fit(inputdf.select("features"))
There seems to be some empty strings in the dataset, as I get following error,
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})
I tried to find such rows using,
issuedf = inputdf.where(inputdf.decimalLatitude == '')
issuedf.show()
But I get the same type conversion error as above. I also tried df.replace
, but I got the same error. How do I remove all rows where such value is present?
Upvotes: 0
Views: 692
Reputation: 2299
This issue can be solved by providing data types when loading the data as follows,
inputdf = my_spark.read.format("mongo").load(schema=StructType(
[StructField("decimalLatitude", DoubleType(), True),
StructField("decimalLongitude", DoubleType(), True)]))
This ensures that all values are of DoubleType
. Now empty values can be removed using inputdf.dropna()
Upvotes: 1