user8676770
user8676770

Reputation:

Importing data with Pyspark : Wrong datatype

I have a problem with Pyspark : when I import my Dataset with Pyspark, all my columns are considered a string, even if my columns are numeric.

I don't have this probleme when I import data with Pandas.

I'm actually using a platform to devlop : Dataiku. the data are already on the platform and I import them with this code :

# Example: Read the descriptor of a Dataiku dataset
mydataset = 
dataiku.Dataset("Extracts___Retail_Master_Data___Product_Hierarchy_HDFS")
# And read it as a Spark dataframe
df = dkuspark.get_dataframe(sqlContext, mydataset)

I can't find a way to import my data into the correct format.

Thanks.

Upvotes: 0

Views: 274

Answers (1)

andreybavt
andreybavt

Reputation: 1321

In Dataiku there are 2 concepts: a storage type and a meaning. So when you explore your dataset you'll see both of them below each column name (type in grey, meaning in blue)

enter image description here

A meaning is a type that Dataiku thinks suits the best according to what's stored in that column.

In your case you should go to your Extracts___Retail_Master_Data___Product_Hierarchy_HDFS dataset settings -> schema -> set correct column types -> save.

If you want to get more details there's a doc page

https://doc.dataiku.com/dss/latest/schemas/index.html

Upvotes: 1

Related Questions