ishwar
ishwar

Reputation: 298

How to clean the data from CSV file

The Sample name.csv data :

Name, ,Age, ,Class,
Diwakar,, ,25,, ,12,
 , , , , ,
Prabhat, ,27, ,15,
Zyan, ,30, ,17,
Jack, ,35, ,21,

reading the csv file:

names = spark.read.csv("name.csv", header="true", inferSchema="true")
names.show()

getting this as a output and we are loosing some data:

+-------+----+---+---+-----+----+
|   Name|   1|Age|  3|Class| _c5|
+-------+----+---+---+-----+----+
|Diwakar|null|   | 25| null|    |
|       |    |   |   |     |null|
|Prabhat|    | 27|   |   15|null|
|   Zyan|    | 30|   |   17|null|
|   Jack|    | 35|   |   21|null|
+-------+----+---+---+-----+----+

I want to have an output like given below:

+-------+---+---+---+-----+----+
|   Name|  1|Age|  3|Class| _c5|
+-------+---+---+---+-----+----+
|Diwakar|   | 25|   |   12|null|
|       |   |   |   |     |null|
|Prabhat|   | 27|   |   15|null|
|   Zyan|   | 30|   |   17|null|
|   Jack|   | 35|   |   21|null|
+-------+---+---+---+-----+----+

Upvotes: 1

Views: 360

Answers (1)

notNull
notNull

Reputation: 31530

We can read all the fields by defining schema and then use the schema while reading CSV file then use When Otherwise we can get the data for Age,Class columns.

Example:

from pyspark.sql.functions import *
from pyspark.sql.types import *

#define schema with same number of columns in csv file
sch=StructType([
    StructField("Name", StringType(), True),
    StructField("1", StringType(), True),
    StructField("Age", StringType(), True),
    StructField("3", StringType(), True),
    StructField("Class", StringType(), True),
    StructField("_c5", StringType(), True),
    StructField("_c6", StringType(), True)
])

#reading csv file with schema
df=spark.read.schema(sch).option("header",True).csv("name.csv")

df.withColumn('Age', when(length(trim(col('Age'))) == 0, col('3')).otherwise(col('Age'))).\
withColumn('1',lit("")).\
withColumn('3',lit("")).\
withColumn('Class',when((col('Class').isNull())|(lower(col('Class')) == 'null'), col('_c6')).when(length(trim(col('Class'))) == 0, lit("null")).otherwise(col('Class'))).\
withColumn('_c5',lit("null")).\
drop("_c6").\
show()

#+-------+---+---+---+-----+----+
#|   Name|  1|Age|  3|Class| _c5|
#+-------+---+---+---+-----+----+
#|Diwakar|   | 25|   |   12|null|
#|       |   |   |   | null|null|
#|Prabhat|   | 27|   |   15|null|
#|   Zyan|   | 30|   |   17|null|
#|   Jack|   | 35|   |   21|null|
#+-------+---+---+---+-----+----+

Upvotes: 1

Related Questions