rrryok
rrryok

Reputation: 65

How to change data type in pyspark dataframe automatically

I have data from csv file, and use it in jupyter notebook with pysaprk. I have many columns and all of them have string data type. I know how to change data type manually, but there is any way to do it automatically?

Upvotes: 0

Views: 1173

Answers (2)

fskj
fskj

Reputation: 964

You can use the inferSchema option when you load your csv file, to let spark try to infer the schema. With the following example csv file, you can get two different schemas depending on whether you set inferSchema to true or not:

seq,date
1,13/10/1942
2,12/02/2013
3,01/02/1959
4,06/04/1939
5,23/10/2053
6,13/03/2059
7,10/12/1983
8,28/10/1952
9,07/04/2033
10,29/11/2035

Example code:

df = (spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "false") # default option
      .load(path))
df.printSchema()
df2 = (spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load(path))
df2.printSchema()

Output:

root
 |-- seq: string (nullable = true)
 |-- date: string (nullable = true)

root
 |-- seq: integer (nullable = true)
 |-- date: string (nullable = true)

Upvotes: 1

Luiz Viola
Luiz Viola

Reputation: 2436

You would need to define the schema before reading the file:

from pyspark.sql import functions as F
from pyspark.sql.types import *

data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
  
df = spark.createDataFrame(data=data2,schema=schema)
df.show()
df.printSchema()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|   id|gender|salary|
+---------+----------+--------+-----+------+------+
|    James|          |   Smith|36636|     M|  3000|
|  Michael|      Rose|        |40288|     M|  4000|
|   Robert|          |Williams|42114|     M|  4000|
|    Maria|      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

Upvotes: 1

Related Questions