Reputation: 31526
I am trying to read a HDFS file created by a HIVE table. The file is in text format. when I open the files I am surprised to see that the lines don't have any field delimiter.
Hive can read the files... but very very slowly. therefore I want to read the content using a spark job.
in order to understand the schema of the table I did a
describe extended foo
and I see this output
Detailed Table Information Table(tableName:foo, dbName:bar, owner:me,
createTime:1456445643, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:
[FieldSchema(name:some_ts, type:int, comment:null), FieldSchema(name:id,
type:string, comment:null), FieldSchema(name:t_p_ref, type:string,
comment:null) location:hdfs://nameservice1/user/hive/bar.db/ft,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:
{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[],
parameters:{numFiles=79, COLUMN_STATS_ACCURATE=true,
transient_lastDdlTime=1456446229, totalSize=8992777753, numRows=20776467,
rawDataSize=8972001286}, viewOriginalText:null, viewExpandedText:null,
tableType:MANAGED_TABLE)
So the output does not show "delim" at all. So how do I read this file? some of the fields are URLs and therefore its quite hard to try to read it as a fixed width type of file
Upvotes: 1
Views: 346
Reputation: 63022
Why not read the data via spark sql - which is quite happy to read hive tables using a HiveContext
? In that case you have the datatypes properly set up as well from the dataframe.
So something like
val hc = new HiveContext(sc)
val df = hc.sql("select * from foo limit 10")
// perform operations on your dataframe ..
Upvotes: 2