Spark doesn't read columns with null values in first row

Below is the content in my csv file :

A1,B1,C1
A2,B2,C2,D1
A3,B3,C3,D2,E1
A4,B4,C4,D3
A5,B5,C5,,E2

So, there are 5 columns but only 3 values in the first row.

I read it using the following command :

val csvDF : DataFrame = spark.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.csv("file.csv")

And following is what i get using csvDF.show()

+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
| A1| B1| C1|
| A2| B2| C2|
| A3| B3| C3|
| A4| B4| C4|
| A5| B5| C5|
+---+---+---+

How can i read all the data in all the columns?

Upvotes: 0

Answers (3)

Ramesh Maharjan

Reputation: 41957

If the column dataTypes and number of columns are known then you can define schema and apply the schema while reading the csv file as dataframe. Below I have defined all five columns as stringType

val schema = StructType(Seq(
  StructField("col1", StringType, true),
  StructField("col2", StringType, true),
  StructField("col3", StringType, true),
  StructField("col4", StringType, true),
  StructField("col5", StringType, true)))

val csvDF : DataFrame = sqlContext.read
  .option("header", "false")
  .option("delimiter", ",")
  .option("inferSchema", "false")
  .schema(schema)
  .csv("file.csv")

You should be getting dataframe as

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|A1  |B1  |C1  |null|null|
|A2  |B2  |C2  |D1  |null|
|A3  |B3  |C3  |D2  |E1  |
|A4  |B4  |C4  |D3  |null|
|A5  |B5  |C5  |null|E2  |
+----+----+----+----+----+

Upvotes: 0

Glennie Helles Sindholt

Reputation: 13154

Basically your csv-file isn't properly formatted in the sense that it doesn't have a equal number of columns in each row, which is required if you want to read it with spark.read.csv. However, you can instead read it with spark.read.textFile and then parse each row.

As I understand it, you do not know the number of columns beforehand, so you want your code to handle an arbitrary number of columns. To do this you need to establish the maximum number of columns in your data set, so you need two passes over your data set.

For this particular problem, I would actually go with RDDs instead of DataFrames or Datasets, like this:

val data  = spark.read.textFile("file.csv").rdd

val rdd = data.map(s => (s, s.split(",").length)).cache
val maxColumns = rdd.map(_._2).max()

val x = rdd
  .map(row => {
    val rowData = row._1.split(",")
    val extraColumns = Array.ofDim[String](maxColumns - rowData.length)
    Row((rowData ++ extraColumns).toList:_*)
  })

Hope that helps :)

Upvotes: 1

Fabich

Reputation: 3059

You can read it as a dataset with only one column (for example by using another delimiter) :

var df = spark.read.format("csv").option("delimiter",";").load("test.csv")
df.show()

+--------------+
|           _c0|
+--------------+
|      A1,B1,C1|
|   A2,B2,C2,D1|
|A3,B3,C3,D2,E1|
|   A4,B4,C4,D3|
|  A5,B5,C5,,E2|
+--------------+

Then you can use this answer to manually split your column in five, this will add null values when the element does not exist :

var csvDF = df.withColumn("_tmp",split($"_c0",",")).select(
    $"_tmp".getItem(0).as("col1"),
    $"_tmp".getItem(1).as("col2"),
    $"_tmp".getItem(2).as("col3"),
    $"_tmp".getItem(3).as("col4"),
    $"_tmp".getItem(4).as("col5")
)
csvDF.show()

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|  A1|  B1|  C1|null|null|
|  A2|  B2|  C2|  D1|null|
|  A3|  B3|  C3|  D2|  E1|
|  A4|  B4|  C4|  D3|null|
|  A5|  B5|  C5|    |  E2|
+----+----+----+----+----+

Upvotes: 0

Spark doesn&#39;t read columns with null values in first row

Answers (3)

Related Questions

Spark doesn't read columns with null values in first row