Reputation:
I have a problem with Pyspark : when I import my Dataset with Pyspark, all my columns are considered a string, even if my columns are numeric.
I don't have this probleme when I import data with Pandas.
I'm actually using a platform to devlop : Dataiku. the data are already on the platform and I import them with this code :
# Example: Read the descriptor of a Dataiku dataset
mydataset =
dataiku.Dataset("Extracts___Retail_Master_Data___Product_Hierarchy_HDFS")
# And read it as a Spark dataframe
df = dkuspark.get_dataframe(sqlContext, mydataset)
I can't find a way to import my data into the correct format.
Thanks.
Upvotes: 0
Views: 274
Reputation: 1321
In Dataiku there are 2 concepts: a storage type and a meaning. So when you explore your dataset you'll see both of them below each column name (type in grey, meaning in blue)
A meaning is a type that Dataiku thinks suits the best according to what's stored in that column.
In your case you should go to your Extracts___Retail_Master_Data___Product_Hierarchy_HDFS dataset settings -> schema -> set correct column types -> save.
If you want to get more details there's a doc page
https://doc.dataiku.com/dss/latest/schemas/index.html
Upvotes: 1