Registered User
Registered User

Reputation: 2299

Remove rows where value is string in pyspark dataframe

I am trying to use KMeans on geospatial data stored in MongoDB database using Apache Spark. The data has following format,

DataFrame[decimalLatitude: double, decimalLongitude: double, features: vector]

The code is as follows, where inputdf is the dataframe.

vecAssembler = VectorAssembler(
                inputCols=["decimalLatitude", "decimalLongitude"],
                outputCol="features")
inputdf = vecAssembler.transform(inputdf)
kmeans = KMeans(k = 10, seed = 123)
model = kmeans.fit(inputdf.select("features"))

There seems to be some empty strings in the dataset, as I get following error,

com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})

I tried to find such rows using,

issuedf = inputdf.where(inputdf.decimalLatitude == '')
issuedf.show()

But I get the same type conversion error as above. I also tried df.replace, but I got the same error. How do I remove all rows where such value is present?

Upvotes: 0

Views: 692

Answers (1)

Registered User
Registered User

Reputation: 2299

This issue can be solved by providing data types when loading the data as follows,

inputdf = my_spark.read.format("mongo").load(schema=StructType(
    [StructField("decimalLatitude", DoubleType(), True),
     StructField("decimalLongitude", DoubleType(), True)]))

This ensures that all values are of DoubleType. Now empty values can be removed using inputdf.dropna()

Upvotes: 1

Related Questions