Nitish Kumar
Nitish Kumar

Reputation: 4870

Convert CSV to Avro file in Java or scala

Is there any library for convert CSV to Avro file in Java or scala.

I tried to google it, But not able to find any library for it.

Upvotes: 2

Views: 11042

Answers (4)

Badr
Badr

Reputation: 131

You can use spark for this purpose or spark-shell (with options : --packages org.apache.spark:spark-avro...) if it's an ad-hoc usage.

A sample code example :

val df = spark.read.csv("example.csv") df.write.format("com.databricks.spark.avro").save("example.avro")

Upvotes: 1

Bala
Bala

Reputation: 11244

You could try this way (Spark 1.6).

people.csv

Michael, 29
Andy, 30
Justin, 19

Pyspark

file = sc.textFile("people.csv")
df = file.map(lambda line: line.split(',')).toDF(['name','age'])

>>> df.show()
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+

df.write.format("com.databricks.spark.avro").save("peopleavro")

Peopleavro

{u'age': u' 29', u'name': u'Michael'}
{u'age': u' 30', u'name': u'Andy'}
{u'age': u' 19', u'name': u'Justin'}

Should you need to maintain data types, then create a schema and pass it.

schema = StructType([StructField("name",StringType(),True),StructField("age",IntegerType(),True)])

df = file.map(lambda line: line.split(',')).toDF(schema)
>>> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Now your avro has

{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "name",
    "type" : [ "string", "null" ]
  }, {
    "name" : "age",
    "type" : [ "int", "null" ]
  } ]
}

Upvotes: 2

BluEOS
BluEOS

Reputation: 606

You can do it easily by :

Upvotes: 0

pedrorijo91
pedrorijo91

Reputation: 7845

By googling I found this article: https://dzone.com/articles/convert-csv-data-avro-data

quoting:

To convert csv data to Avro data using Hive we need to follow the steps below:

  1. Create a Hive table stored as textfile and specify your csv delimiter also.
  2. Load csv file to above table using "load data" command.
  3. Create another Hive table using AvroSerDe.
  4. Insert data from former table to new Avro Hive table using "insert overwrite" command.

Example: using a csv (student_id, subject_id, grade)

--1. Create a Hive table stored as textfile
USE test;
CREATE TABLE csv_table (
student_id INT,
subject_id INT,
marks INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE;

--2. Load csv_table with student.csv data
LOAD DATA LOCAL INPATH "/path/to/student.csv" OVERWRITE INTO TABLE test.csv_table;

--3. Create another Hive table using AvroSerDe
CREATE TABLE avro_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
    'avro.schema.literal'='{
      "namespace": "com.rishav.avro",
      "name": "student_marks",
      "type": "record",
      "fields": [ { "name":"student_id","type":"int"}, { "name":"subject_id","type":"int"}, { "name":"marks","type":"int"}]
    }');

--4. Load avro_table with data from csv_table
INSERT OVERWRITE TABLE avro_table SELECT student_id, subject_id, marks FROM csv_table;

Upvotes: 2

Related Questions