Reputation: 1973
I have a sample spark dataframe that I create from pandas dataframe -
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
from pyspark.sql.types import *
import pandas as pd
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
#create sample spark dataframe first and then create pandas dataframe from it
import pandas as pd
pdf = pd.DataFrame([[1,"hello world. lets shine and spread happiness"],[2,"not so sure"],[2,"cool i like it"],[2,"cool i like it"],[2,"cool i like it"]]
, columns = ['input1','input2'])
df = spark.createDataFrame(pdf) # this is spark df
now, I have the data types as
df.printSchema()
root
|-- input1: long (nullable = true)
|-- input2: string (nullable = true)
If i convert this spark dataframe back to pandas using -
pandas_df = df.toPandas()
and then if I try to print the data types, I get back object type for second column instead of string type.
pandas_df.dtypes
input1 int64
input2 object
dtype: object
How do I convert this string type in spark correctly to string type in pandas ?
Upvotes: 0
Views: 1869
Reputation: 21709
To convert to string, you can use StringDtype
:
pandas_df["input_2"] = pandas_df["input_2"].astype(pd.StringDtype())
Upvotes: 1