spark dataframe is loading all nulls from csv file

Question

I have a file with following data

####$ cat products.csv 
1,tv,sony,hd,699
2,tv,sony,uhd,799
3,tv,samsung,hd,599
4,tv,samsung,uhd,799
5,phone,iphone,x,999
6,phone,iphone,11,999
7,phone,samsung,10,899
8,phone,samsung,10note,999
9,phone,pixel,4,799
10,phone,pixel,3,699

Im trying to load this into spark dataframe it is giving me no errors but it is loading all nulls.

scala> val productSchema = StructType((Array(StructField("productId",IntegerType,true),StructField("productType",IntegerType,true),StructField("company",IntegerType,true),StructField("model",IntegerType,true),StructField("price",IntegerType,true))))
productSchema: org.apache.spark.sql.types.StructType = StructType(StructField(productId,IntegerType,true), StructField(productType,IntegerType,true), StructField(company,IntegerType,true), StructField(model,IntegerType,true), StructField(price,IntegerType,true))

scala> val df = spark.read.format("csv").option("header", "false").schema(productSchema).load("/path/products_js/products.csv")
df: org.apache.spark.sql.DataFrame = [productId: int, productType: int ... 3 more fields]

scala> df.show
+---------+-----------+-------+-----+-----+
|productId|productType|company|model|price|
+---------+-----------+-------+-----+-----+
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
+---------+-----------+-------+-----+-----+

Now I tried a different way to load the data and it worked

scala> val temp = spark.read.csv("/path/products_js/products.csv")
temp: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 3 more fields]

scala> temp.show
+---+-----+-------+------+---+
|_c0|  _c1|    _c2|   _c3|_c4|
+---+-----+-------+------+---+
|  1|   tv|   sony|    hd|699|
|  2|   tv|   sony|   uhd|799|
|  3|   tv|samsung|    hd|599|
|  4|   tv|samsung|   uhd|799|
|  5|phone| iphone|     x|999|
|  6|phone| iphone|    11|999|
|  7|phone|samsung|    10|899|
|  8|phone|samsung|10note|999|
|  9|phone|  pixel|     4|799|
| 10|phone|  pixel|     3|699|
+---+-----+-------+------+---+

In the second approach it loaded data but I cannot add the scheme to dataframe. what is the difference between about two methods of loading data, why is it loading null for the first approach? can any one help me

Lamanus · Accepted Answer

You define the string type of columns as integertype that is wrong first. And this is working,

import org.apache.spark.sql.types.{StructType, IntegerType, StringType}

val productSchema = new StructType()
                        .add("productId", "int")
                        .add("productType", "string")
                        .add("company", "string")
                        .add("model", "string")
                        .add("price", "int")

val df = spark.read.format("csv")
            .option("header", "false")
            .schema(productSchema)
            .load("test.csv")

df.show()

the result is

+---------+-----------+-------+------+-----+
|productId|productType|company| model|price|
+---------+-----------+-------+------+-----+
|        1|         tv|   sony|    hd|  699|
|        2|         tv|   sony|   uhd|  799|
|        3|         tv|samsung|    hd|  599|
|        4|         tv|samsung|   uhd|  799|
|        5|      phone| iphone|     x|  999|
|        6|      phone| iphone|    11|  999|
|        7|      phone|samsung|    10|  899|
|        8|      phone|samsung|10note|  999|
|        9|      phone|  pixel|     4|  799|
|       10|      phone|  pixel|     3|  699|
+---------+-----------+-------+------+-----+

spark dataframe is loading all nulls from csv file

Answers (1)

Related Questions