tuxian
tuxian

Reputation: 178

Parse logs using parallel command asynchronously

I have the following sample log format in the log file

[01/18/2024 23:59:58 job100305.mydomain.com dbmy.db18.import 3823039] {"PREFIX":"ZaoBzgriEBMAIglWABgEZg","LEVEL":"info","MESSAGE":"received\n"}
[01/18/2024 23:59:58 job100129.mydomain.com db.db5.my-route 2423585] {"ELAPSED":"0.036","PREFIX":"ZaoBzgriFH4AJIMBAALvAw","LEVEL":"info","MESSAGE":"finished in 0.036\n","ROUTINGKEY":"mydomain11016.Interface.DistributeManager.InboundDistributor.RemotePutMessageBatch"}
...

So I have this log file name such as

mylog-2024-01-19_00001
mylog-2024-01-19_00002
mylog-2024-01-19_00003
....
mylog-2024-01-19_00023

one for each hour in a day on the same folder. These files are huge files (ex: ~4TB each). So I have difficulty reading content from these files.

So I need to write a shell script which should make use of the parallel command to parse the content of the file. I can able to access following machines from the current machine.

dev1.mydomain.com
dev2.mydomain.com
....
dev10.mydomain.com

And I have ssh keygen exchange enabled with all these machines. That means I can able to ssh to these machines without password. And I have the sshloginfile at my home directory with all the above domains with the configuration on how many threads should be running

4/ dev1.mydomain.com
4/ dev2.mydomain.com
...
4/ dev3.mydomain.com

Now I need a shell script which should asynchronously fire the parallel command to grep the given text which should,

  1. one parallel process for each log file
  2. accumulate all the results to single file
  3. for each day one dev server should be used
  4. It should parse all the logs in a given date range
  5. all the dev servers contains 4 cores. So 4 threads I need to run at a time.

I have tried this script. But not sure what I'm doing wrong it is not producing any results.

#!/bin/bash

log_directory="/path/to/log/files" directory
start_date="2024-01-19"
end_date="2024-01-19"
output_file="parsed_logs.txt"

for log_file in $(ls $log_directory/mylog-$start_date*); do
    date_part=$(echo $log_file | grep -oP '\d{4}-\d{2}-\d{2}')

    if [[ "$date_part" >= "$start_date" && "$date_part" <= "$end_date" ]]; then
        dev_server=$(grep "$date_part" ~/sshlogins | awk '{print $2}')

        parallel --sshloginfile ~/sshlogins -S $dev_server -j4 "grep 'your_search_pattern' {} >> $output_file" ::: $log_file
    fi
done

echo "Parsing completed. Results are stored in $output_file."

Upvotes: 0

Views: 112

Answers (2)

Ole Tange
Ole Tange

Reputation: 33740

From your description something like this may work:

[... select the dev server...]
[... set $date to 2024-02-21 ...]

ssh $dev parallel grep foo ::: /path/to/log/mylog-$date* > combined_output

If parallel is not installed on $dev and the paths are the same:

[... select the dev server...]
[... set $date to 2024-02-21 ...]

eval parallel -S 4/$dev grep foo ::: /path/to/log/mylog-$date* > combined_output

Upvotes: 0

jkool702
jkool702

Reputation: 29

One aspect of what you are trying to do that isn't clear: what is on your local machine and what is on the remote server? I suspect this is related to your issue.

From your question, it sounds like everything is on the remote server. I'm not terribly familiar with using parallel's "distribute over ssh" features, but I believe that your code tries to run parallel locally and set up the parallel call based on the log folder structure on your local machine and then just distribute the already setup/resolved grep commands to run on the remote server.

For this to work you would need a "skeleton copy" of the log folder directory on the remote server on your local machine (i.e., a directory with the same structure and filenames as the remote server, but where all the log files were empty instead of filled with an absurd 4 tb of data each). Without this youd get no output since you would be looping over a directory that doesnt exist on your local machine.

Alternate solutions would be to either:

a) run the entire loop that checks the log folder directory and loops over a date range and runs the parallel command on the remote machine (e.g., via ssh) where the log folder directory actually exists, or

b) set up the parallel call by using ssh to query the log folder directory on the remote server. This would look something like replacing ls /path/to/log/dir with ssh user@server 'ls /path/to/log/dir'.

Upvotes: 0

Related Questions