Reputation: 176
Below is the content in my csv file :
A1,B1,C1
A2,B2,C2,D1
A3,B3,C3,D2,E1
A4,B4,C4,D3
A5,B5,C5,,E2
So, there are 5 columns but only 3 values in the first row.
I read it using the following command :
val csvDF : DataFrame = spark.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.csv("file.csv")
And following is what i get using csvDF.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
| A1| B1| C1|
| A2| B2| C2|
| A3| B3| C3|
| A4| B4| C4|
| A5| B5| C5|
+---+---+---+
How can i read all the data in all the columns?
Upvotes: 0
Views: 4940
Reputation: 41957
If the column dataTypes
and number of columns are known then you can define schema
and apply the schema
while reading the csv
file as dataframe
. Below I have defined all five columns as stringType
val schema = StructType(Seq(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true),
StructField("col4", StringType, true),
StructField("col5", StringType, true)))
val csvDF : DataFrame = sqlContext.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(schema)
.csv("file.csv")
You should be getting dataframe
as
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|A1 |B1 |C1 |null|null|
|A2 |B2 |C2 |D1 |null|
|A3 |B3 |C3 |D2 |E1 |
|A4 |B4 |C4 |D3 |null|
|A5 |B5 |C5 |null|E2 |
+----+----+----+----+----+
Upvotes: 0
Reputation: 13154
Basically your csv-file isn't properly formatted in the sense that it doesn't have a equal number of columns in each row, which is required if you want to read it with spark.read.csv
. However, you can instead read it with spark.read.textFile
and then parse each row.
As I understand it, you do not know the number of columns beforehand, so you want your code to handle an arbitrary number of columns. To do this you need to establish the maximum number of columns in your data set, so you need two passes over your data set.
For this particular problem, I would actually go with RDDs instead of DataFrames or Datasets, like this:
val data = spark.read.textFile("file.csv").rdd
val rdd = data.map(s => (s, s.split(",").length)).cache
val maxColumns = rdd.map(_._2).max()
val x = rdd
.map(row => {
val rowData = row._1.split(",")
val extraColumns = Array.ofDim[String](maxColumns - rowData.length)
Row((rowData ++ extraColumns).toList:_*)
})
Hope that helps :)
Upvotes: 1
Reputation: 3059
You can read it as a dataset with only one column (for example by using another delimiter) :
var df = spark.read.format("csv").option("delimiter",";").load("test.csv")
df.show()
+--------------+
| _c0|
+--------------+
| A1,B1,C1|
| A2,B2,C2,D1|
|A3,B3,C3,D2,E1|
| A4,B4,C4,D3|
| A5,B5,C5,,E2|
+--------------+
Then you can use this answer to manually split your column in five, this will add null values when the element does not exist :
var csvDF = df.withColumn("_tmp",split($"_c0",",")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3"),
$"_tmp".getItem(3).as("col4"),
$"_tmp".getItem(4).as("col5")
)
csvDF.show()
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| A1| B1| C1|null|null|
| A2| B2| C2| D1|null|
| A3| B3| C3| D2| E1|
| A4| B4| C4| D3|null|
| A5| B5| C5| | E2|
+----+----+----+----+----+
Upvotes: 0