John Thomas
John Thomas

Reputation: 222

How to read a space-delimited text file and save it to Hive?

I have a string like below. The first row is the header, and the rest are the column values. I want to create a dataframe (Spark 1.6 and Java7) from the String , and convert the values under col3 and col4 as DOUBLE .

col1 col2 col3 col4 col5
val1 val2 val3 val4 val5
val6 val7 val8 val9 val10
val11 val12 val13 val14 val15

After creating the dataframe for the above , i have two fields to be in the first row whose values are constants.

const1 const2 col1 col2 col3 col4 col5
const1 const2 val1 val2 val3 val4 val5
const1 const2 val6 val7 val8 val9 val10
const1 const2 val11 val12 val13 val14 val15

I want to write the above dataframe to hive table with the schema (const1 int, const2 int, col1 double, col2 double , col3 double , col4 double, col5 double). I'm using Spark 1.6 and Java7 .

Upvotes: 2

Views: 5125

Answers (2)

philantrovert
philantrovert

Reputation: 10092

Since you have all the required data on a file, and the columns you want to append are constants, this can also be done using a simple case class and createDataFrame in Scala. Providing a scala related solution which might help you understand how you should approach for the solution

case class schema (const1: Int, const2: Int, col1: String, col2: String, col3: String, col4: String, col5: String)
//Skip Header while loading the data
val fileRDD = sc.textFile("file")
                .mapPartitionsWithIndex{ (index, row) => if (index==0) row.drop(1) else row }
                .map(x => x.split(" "))
                .map(x => schema(1, 2, x(0), x(1), x(2), x(3), x(4) ))
val df = sqlContext.createDataFrame(fileRDD)
df.show()    
+------+------+-----+-----+-----+-----+-----+
|const1|const2| col1| col2| col3| col4| col5|
+------+------+-----+-----+-----+-----+-----+
|     1|     2| col1| col2| col3| col4| col5|
|     1|     2| val1| val2| val3| val4| val5|
|     1|     2| val6| val7| val8| val9|val10|
|     1|     2|val11|val12|val13|val14|val15|
+------+------+-----+-----+-----+-----+-----+

And loading data to hive, as Jacek has mentioned:

Once done, saving to Hive is just a matter of using write() method of your SQLContext and:

write().saveAsTable(tableName)

See DataFrame API.

Upvotes: 0

Jacek Laskowski
Jacek Laskowski

Reputation: 74779

I strongly recommend using the latest and greatest Spark 2.1.1 that supports CSV files out of the box.


Unless I'm mistaken CSV support in Spark 1.6 is by using spark-csv package.

Start your Spark environment with the package loaded:

--packages com.databricks:spark-csv_2.11:1.5.0

and that gives you csv format support.

From the package's homepage:

SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
    .format("com.databricks.spark.csv")  // <-- enables csv support in Spark SQL
    .option("inferSchema", "true")
    .option("header", "true")
    .load("cars.csv");

You'd have to use delimiter option since it defaults to the comma.

delimiter by default columns are delimited using ,, but delimiter can be set to any character

With the file loaded as a DataFrame you can add the constant columns using DataFrame.withColumn method:

public DataFrame withColumn(String colName, Column col)

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

Once done, saving to Hive is just a matter of using write() method of your SQLContext and:

write().saveAsTable(tableName)

See DataFrame API.

Upvotes: 0

Related Questions