Stefano Sorgente
Stefano Sorgente

Reputation: 93

Pyspark reads data as string but on Mongo they are double

I have a collection, named "Vendite", with some fields on my MongoDB's database. Just to make everything easier, I'll take one field (named "ven_lordo") as example but I have this problem with a lot of them.

If I use the following command on mongosh I get as fieldType for each document: "double".

db.Vendite.aggregate([{ "$project": { "fieldType": {  "$type": "$ven_lordo"  }}}])

When I read data from MongoDB using Pyspark I use the following code:

df = sparkSession.read.format("mongodb")\
        .option("spark.mongodb.read.database", db)\
        .option("spark.mongodb.read.collection", collection)\
        .option("spark.mongodb.read.connection.uri", uri)\
        .load()
print(df.schema["ven_lordo"].dataType)
df.createOrReplaceTempView("df")
print(df.schema["ven_lordo"].dataType)

The print I get is "StringType" in both cases (before and after the temp view).

How can i read data using the same type they have in Mongo?

I would like to add a particular note: if I apply a sum function to two columns with this problem i get a valid result in .show() function, i.e. it shows me a number (I don't know if string or double) and the result is correct too. Is it normal that the sum works with two strings?

Upvotes: 0

Views: 26

Answers (0)

Related Questions