Reputation: 44
I am using PySpark to read a relative large csv file (~10GB):
ddf = spark.read.csv('directory/my_file.csv')
All the columns have the datatype string
After changing the datatype of for example column_a
I can see the datatype changed to an integer
. If I write the ddf
to a parquet file and read the parquet file I notice that all columns have the datatype string
again. Question: How can I make sure the parquet file contains the correct datatypes so that I do not have to change the datatype again (while reading the parquet file).
Notes:
I write the ddf
as a parquet file as follows:
ddf.repartition(10).write.parquet('directory/my_parquet_file', mode='overwrite')
I use:
2.0.0.2
Upvotes: 0
Views: 229
Reputation: 189
I read my large files with pandas and not have this problem. Try use pandas. http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html
In[1]: Import pandas as pd
In[2]: df = pd.read_csv('directory/my_file.csv')
Upvotes: 0