Jeevan
Jeevan

Reputation: 740

spark dataframe is loading all nulls from csv file

I have a file with following data

####$ cat products.csv 
1,tv,sony,hd,699
2,tv,sony,uhd,799
3,tv,samsung,hd,599
4,tv,samsung,uhd,799
5,phone,iphone,x,999
6,phone,iphone,11,999
7,phone,samsung,10,899
8,phone,samsung,10note,999
9,phone,pixel,4,799
10,phone,pixel,3,699

Im trying to load this into spark dataframe it is giving me no errors but it is loading all nulls.

scala> val productSchema = StructType((Array(StructField("productId",IntegerType,true),StructField("productType",IntegerType,true),StructField("company",IntegerType,true),StructField("model",IntegerType,true),StructField("price",IntegerType,true))))
productSchema: org.apache.spark.sql.types.StructType = StructType(StructField(productId,IntegerType,true), StructField(productType,IntegerType,true), StructField(company,IntegerType,true), StructField(model,IntegerType,true), StructField(price,IntegerType,true))

scala> val df = spark.read.format("csv").option("header", "false").schema(productSchema).load("/path/products_js/products.csv")
df: org.apache.spark.sql.DataFrame = [productId: int, productType: int ... 3 more fields]

scala> df.show
+---------+-----------+-------+-----+-----+
|productId|productType|company|model|price|
+---------+-----------+-------+-----+-----+
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
|     null|       null|   null| null| null|
+---------+-----------+-------+-----+-----+

Now I tried a different way to load the data and it worked

scala> val temp = spark.read.csv("/path/products_js/products.csv")
temp: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 3 more fields]

scala> temp.show
+---+-----+-------+------+---+
|_c0|  _c1|    _c2|   _c3|_c4|
+---+-----+-------+------+---+
|  1|   tv|   sony|    hd|699|
|  2|   tv|   sony|   uhd|799|
|  3|   tv|samsung|    hd|599|
|  4|   tv|samsung|   uhd|799|
|  5|phone| iphone|     x|999|
|  6|phone| iphone|    11|999|
|  7|phone|samsung|    10|899|
|  8|phone|samsung|10note|999|
|  9|phone|  pixel|     4|799|
| 10|phone|  pixel|     3|699|
+---+-----+-------+------+---+

In the second approach it loaded data but I cannot add the scheme to dataframe. what is the difference between about two methods of loading data, why is it loading null for the first approach? can any one help me

Upvotes: 3

Views: 2707

Answers (1)

Lamanus
Lamanus

Reputation: 13541

You define the string type of columns as integertype that is wrong first. And this is working,

import org.apache.spark.sql.types.{StructType, IntegerType, StringType}

val productSchema = new StructType()
                        .add("productId", "int")
                        .add("productType", "string")
                        .add("company", "string")
                        .add("model", "string")
                        .add("price", "int")

val df = spark.read.format("csv")
            .option("header", "false")
            .schema(productSchema)
            .load("test.csv")

df.show()

the result is

+---------+-----------+-------+------+-----+
|productId|productType|company| model|price|
+---------+-----------+-------+------+-----+
|        1|         tv|   sony|    hd|  699|
|        2|         tv|   sony|   uhd|  799|
|        3|         tv|samsung|    hd|  599|
|        4|         tv|samsung|   uhd|  799|
|        5|      phone| iphone|     x|  999|
|        6|      phone| iphone|    11|  999|
|        7|      phone|samsung|    10|  899|
|        8|      phone|samsung|10note|  999|
|        9|      phone|  pixel|     4|  799|
|       10|      phone|  pixel|     3|  699|
+---------+-----------+-------+------+-----+

Upvotes: 4

Related Questions