Reputation: 363
In our env, we have a number of users that issue Hive queries from different tools. Many times the user will give us a query and say it failed around a certain time, and we need to dig through and find the reason. So we need to parse hive logs and find the query.
Finding the query is the easier part, is there a way to figure out easily once the query is found (usually after the words Executing command) in hive logs, how do I easily get the yarn application id associated with it?
I'm trying to automate this process so that I can cut down our analysis time.
Thanks
Upvotes: 2
Views: 5578
Reputation: 1036
I actually found a way to do this. There's a couple of steps
I have a script to tail the log files. Reason for this is to take out the carriage returns in the SQL statement. I'll put it at the bottom but basically just doing tail on /opt/mapr/hive/hive-2.3/logs/mapr/mapr-hiveserver2-*.log and look for "Executing command". (I believe you've already done this). For example, I get this:
2021-01-21T15:43:21,836 INFO [HiveServer2-Background-Pool: Thread-4024376] ql.Driver: Executing command(queryId=mapr_20210121154321_abfc0235-571f-44dd-961c-521064c8d163): SELECT COUNT(*) FROM prod.statsman WHERE dt = '20210101' AND oid like 'qos.8%'
Now grab the queryId and grep the same file for the query id and "Submitting dag to TezSession". From this you can find your application_id
cat /opt/mapr/hive/hive-2.x/logs/mapr/mapr-hiveserver2-*.log | grep mapr_20210121154321_abfc0235-571f-44dd-961c-521064c8d163 | grep "Submitting dag to TezSession"
2021-01-21T15:43:21,952 INFO [HiveServer2-Background-Pool: Thread-4024376] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-bd39c7da-7041-4fff-ab51-9d0bac18178e, applicationId=application_1607034625968_103472, dagName=SELECT COUNT(*) FROM prod.statsma...'qos.8%'(Stage-1), callerContext={ context=HIVE, callerType=HIVE_QUERY_ID, callerId=mapr_20210121154321_abfc0235-571f-44dd-961c-521064c8d163 }
Note: to tail the log file I create this perl code as stripcrlf.pl
#!/usr/bin/perl
use strict;
use warnings;
while(<>) {
my $line = $_;
$line =~ s/\r\n/ /g;
print $line;
}
Then this to tail the log file
tail -f /opt/mapr/hive/hive-2.x/logs/mapr/mapr-hiveserver2-*.log | ./stripcrlf.pl | grep "Executing command" | grep -viP "show partitions|describe |analyze table|show schemas|create external|desc "
the grep -viP is optional, I add it as we have a process that runs a bunch of these sort of commands.
Upvotes: 0
Reputation: 161
One thing you can do is, you can use the yarn application -list -appStates ALL
command. What it does is, it will give you list of all the applications with their respective application-id's, name, user, state, tracking url etc.. So you can drill down what you need. You can even specify the state, like which state data you need for eg:- yarn application -list -appStates FINISHED
will display all records which have state as finished.
Upvotes: 1
Reputation: 2899
Not a full answer, but it can help get you started.
Instead of looking at the hive logs, you could look at the timeline server rest API, and maybe the resource manager API. From then you should be able to get a list of queries and the app id.
If you are using Hortonworks have a look at the Tez view. It displays queries and app id. Looking at what actually happens with the developer tools of your browser can help ge the flow of API calls. I am not familiar enough with Cloudera or MapR to know where to look at.
The advantage of this option is that it can all be done remotely and handles HA more gracefully as it does not need to run directly on your hive servers(s)
Upvotes: 0