Reputation: 740
I have a file with following data
####$ cat products.csv
1,tv,sony,hd,699
2,tv,sony,uhd,799
3,tv,samsung,hd,599
4,tv,samsung,uhd,799
5,phone,iphone,x,999
6,phone,iphone,11,999
7,phone,samsung,10,899
8,phone,samsung,10note,999
9,phone,pixel,4,799
10,phone,pixel,3,699
Im trying to load this into spark dataframe it is giving me no errors but it is loading all nulls.
scala> val productSchema = StructType((Array(StructField("productId",IntegerType,true),StructField("productType",IntegerType,true),StructField("company",IntegerType,true),StructField("model",IntegerType,true),StructField("price",IntegerType,true))))
productSchema: org.apache.spark.sql.types.StructType = StructType(StructField(productId,IntegerType,true), StructField(productType,IntegerType,true), StructField(company,IntegerType,true), StructField(model,IntegerType,true), StructField(price,IntegerType,true))
scala> val df = spark.read.format("csv").option("header", "false").schema(productSchema).load("/path/products_js/products.csv")
df: org.apache.spark.sql.DataFrame = [productId: int, productType: int ... 3 more fields]
scala> df.show
+---------+-----------+-------+-----+-----+
|productId|productType|company|model|price|
+---------+-----------+-------+-----+-----+
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
+---------+-----------+-------+-----+-----+
Now I tried a different way to load the data and it worked
scala> val temp = spark.read.csv("/path/products_js/products.csv")
temp: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 3 more fields]
scala> temp.show
+---+-----+-------+------+---+
|_c0| _c1| _c2| _c3|_c4|
+---+-----+-------+------+---+
| 1| tv| sony| hd|699|
| 2| tv| sony| uhd|799|
| 3| tv|samsung| hd|599|
| 4| tv|samsung| uhd|799|
| 5|phone| iphone| x|999|
| 6|phone| iphone| 11|999|
| 7|phone|samsung| 10|899|
| 8|phone|samsung|10note|999|
| 9|phone| pixel| 4|799|
| 10|phone| pixel| 3|699|
+---+-----+-------+------+---+
In the second approach it loaded data but I cannot add the scheme to dataframe. what is the difference between about two methods of loading data, why is it loading null for the first approach? can any one help me
Upvotes: 3
Views: 2707
Reputation: 13541
You define the string type of columns as integertype that is wrong first. And this is working,
import org.apache.spark.sql.types.{StructType, IntegerType, StringType}
val productSchema = new StructType()
.add("productId", "int")
.add("productType", "string")
.add("company", "string")
.add("model", "string")
.add("price", "int")
val df = spark.read.format("csv")
.option("header", "false")
.schema(productSchema)
.load("test.csv")
df.show()
the result is
+---------+-----------+-------+------+-----+
|productId|productType|company| model|price|
+---------+-----------+-------+------+-----+
| 1| tv| sony| hd| 699|
| 2| tv| sony| uhd| 799|
| 3| tv|samsung| hd| 599|
| 4| tv|samsung| uhd| 799|
| 5| phone| iphone| x| 999|
| 6| phone| iphone| 11| 999|
| 7| phone|samsung| 10| 899|
| 8| phone|samsung|10note| 999|
| 9| phone| pixel| 4| 799|
| 10| phone| pixel| 3| 699|
+---------+-----------+-------+------+-----+
Upvotes: 4