Reputation: 403
I am reading in some data from a json file and converting it to a string that I use to send my data to hive.
The data is arriving fine in Hive, but it gets distributed in to the wrong columns, I have made a small example
in Hive:
Table name = TestTable, Column1 = test1, Column2 = test2`
My code:
data = hiveContext.sql("select \"hej\" as test1, \"med\" as test2")
data.write.mode("append").saveAsTable("TestTable")
data = hiveContext.sql("select \"hej\" as test2, \"med\" as test1")
data.write.mode("append").saveAsTable("TestTable")
this results in "hej"
showing up in test1
both times and "med"
showing up in test2
both times, instead of one showing up in each.
It always just seems to show up in the order written and not go in to the mentioned columns that I mention with the 'as'
keyword.
Any one have any ideas?
Upvotes: 3
Views: 12179
Reputation: 9067
It always just seems to show up in the order written...
You are right. Spark works just like any SQL database would. The column names in the input dataset do not make any difference.
And since you do not explicitly map the output column to the input columns, Spark has to assume that the mapping is done by position.
Just meditate over the following test case...
hiveContext.sql("create temporary table TestTable (RunId string, Test1 string, Test2 string)")
hiveContext.sql("insert into table TestTable select 'A', 'x1', 'y1'")
hiveContext.sql("insert into table TestTable (RunId, Test1, Test2) select 'B', 'x2' as Blurb, 'y2' as Test1")
hiveContext.sql("insert into table TestTable (RunId, Test2, Test1) select 'C', 'x3' as Blurb, 'y3' as Test1")
data = hiveContext.sql("select 'xxx' as Test1, 'yyy' as Test2"))
data.registerTempTable("Dummy")
hiveContext.sql("insert into table TestTable(Test1, RunId, Test2) select Test1, 'D', Test2 from Dummy")
hiveContext.sql("insert into table TestTable select Test1, 'E', Test2 from Dummy")
hiveContext.sql("select * from TestTable").show(20)
Disclaimer - I did not actually test these commands, there are probably a couple of typos and syntax issues inside (especially since you do not mention your Hive and Spark versions) but you should see the point.
Upvotes: 5