Pyspark reads data as string but on Mongo they are double

Question

I have a collection, named "Vendite", with some fields on my MongoDB's database. Just to make everything easier, I'll take one field (named "ven_lordo") as example but I have this problem with a lot of them.

If I use the following command on mongosh I get as fieldType for each document: "double".

db.Vendite.aggregate([{ "$project": { "fieldType": {  "$type": "$ven_lordo"  }}}])

When I read data from MongoDB using Pyspark I use the following code:

df = sparkSession.read.format("mongodb")\
        .option("spark.mongodb.read.database", db)\
        .option("spark.mongodb.read.collection", collection)\
        .option("spark.mongodb.read.connection.uri", uri)\
        .load()
print(df.schema["ven_lordo"].dataType)
df.createOrReplaceTempView("df")
print(df.schema["ven_lordo"].dataType)

The print I get is "StringType" in both cases (before and after the temp view).

How can i read data using the same type they have in Mongo?

I would like to add a particular note: if I apply a sum function to two columns with this problem i get a valid result in .show() function, i.e. it shows me a number (I don't know if string or double) and the result is correct too. Is it normal that the sum works with two strings?

Pyspark reads data as string but on Mongo they are double

Answers (0)

Related Questions