Reputation: 85
I have been working on a project that include a hive query.
INSERT INTO OVERWRITE .... TRANSFORM (....) USING 'python script.py' FROM .... LEFT OUTER JOIN . . . LEFT OUTER JOIN . . . LEFT OUTER JOIN
At the begining everything work fine until we loaded a big amount of dummy data. We just write the same records with small variations on some fields. After that we run this again and we are getting a Broken pipe error without much information. There is no log about the error, just the IOException: Broken pipe error. . . .
To simplify the script and isolate errors we modify the script to
for line in sys.stdin.readlines():
print line
to avoid any error at that level. We still have the same error.
Upvotes: 2
Views: 3938
Reputation: 85
Another work around on this is to remove the transform and generate a new query inserting the data in another table just running the transformation. I'm not 100% sure why, the scrtip is correct. I think the issue may be a really big amount of data streamed because of the so many joins.
Upvotes: 0
Reputation: 85
The problem seems to be solved by spliting so many joins in different queries and using intermediate tables. Then you just add a final query with a last join summarizing all the previous results. As I understand this mean no error at the script level but too many data to handle by hive
Upvotes: 1