Regressor
Regressor

Reputation: 1973

How to convert String type column in spark dataframe to String type column in Pandas dataframe

I have a sample spark dataframe that I create from pandas dataframe -

from pyspark.sql import SparkSession

import pyspark.sql.functions as F
from pyspark.sql.types import StringType
from pyspark.sql.types import *

import pandas as pd

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#create sample spark dataframe first and then create pandas dataframe from it
import pandas as pd
pdf = pd.DataFrame([[1,"hello world. lets shine and spread happiness"],[2,"not so sure"],[2,"cool i like it"],[2,"cool i like it"],[2,"cool i like it"]]
                   , columns = ['input1','input2'])
df = spark.createDataFrame(pdf) # this is spark df

now, I have the data types as

df.printSchema()

root
 |-- input1: long (nullable = true)
 |-- input2: string (nullable = true)

If i convert this spark dataframe back to pandas using -

pandas_df = df.toPandas() 

and then if I try to print the data types, I get back object type for second column instead of string type.

pandas_df.dtypes
input1     int64
input2    object
dtype: object

How do I convert this string type in spark correctly to string type in pandas ?

Upvotes: 0

Views: 1869

Answers (1)

YOLO
YOLO

Reputation: 21709

To convert to string, you can use StringDtype:

pandas_df["input_2"] = pandas_df["input_2"].astype(pd.StringDtype())

Upvotes: 1

Related Questions