Reputation: 3
I have a dataset in the form of a userID, itemID ratings matrix that I am trying to convert to the form {userID, itemID, rating} for use with the Mahout item-based recommender, as described here: https://mahout.apache.org/users/recommender/userbased-5-minutes.html#dataset.
In other words, I want to convert something like this:
1 2 3
1 1.0 2.0 3.0
2 4.0 5.0 6.0
3 7.0 8.0 9.0
Into something like this:
1,1,1.0
1,2,2.0
1,3,3.0
2,1,4.0
2,2,5.0
2,3,6.0
3,1,7.0
3,2,8.0
3,3,9.0
Is there a way to accomplish this using Apache Hadoop tools (Pig, Hive, etc.)?
Upvotes: 0
Views: 88
Reputation: 1491
You can use explode
(in hive):
if your input table looks like this:
userID item1 item2 item3
----------------------
1 1.0 2.0 3.0
2 4.0 5.0 6.0
3 7.0 8.0 9.0
Then your query can be:
SELECT userID, split(item_val,'_')[0] as item, split(item_val,'_')[1] as val
from ( SELECT userID,
array(concat_ws('_','item1',item1),
concat_ws('_','item2',item2),
concat_ws('_','item3',item3)) as arr from in_table) a
LATERAL VIEW explode(arr) exp as item_val;
Explanation: The inner query generates this output:
userID arr
-----------------------------------------
1 (item1_1.0 item2_2.0 item3_3.0)
2 (item1_4.0 item2_5.0 item3_6.0)
3 (item1_7.0 item2_8.0 item3_9.0)
Then after the explode, each line will have userID, itemID and value - only need to split the itemID and the value.
Also, if the table's itemIDs are defined as double
, you need to CAST(item2 as string)
before sending them into concat_ws
.
Upvotes: 0