Reputation: 178
I have the following sample log format in the log file
[01/18/2024 23:59:58 job100305.mydomain.com dbmy.db18.import 3823039] {"PREFIX":"ZaoBzgriEBMAIglWABgEZg","LEVEL":"info","MESSAGE":"received\n"}
[01/18/2024 23:59:58 job100129.mydomain.com db.db5.my-route 2423585] {"ELAPSED":"0.036","PREFIX":"ZaoBzgriFH4AJIMBAALvAw","LEVEL":"info","MESSAGE":"finished in 0.036\n","ROUTINGKEY":"mydomain11016.Interface.DistributeManager.InboundDistributor.RemotePutMessageBatch"}
...
So I have this log file name such as
mylog-2024-01-19_00001
mylog-2024-01-19_00002
mylog-2024-01-19_00003
....
mylog-2024-01-19_00023
one for each hour in a day on the same folder. These files are huge files (ex: ~4TB each). So I have difficulty reading content from these files.
So I need to write a shell script which should make use of the parallel command to parse the content of the file. I can able to access following machines from the current machine.
dev1.mydomain.com
dev2.mydomain.com
....
dev10.mydomain.com
And I have ssh keygen exchange enabled with all these machines. That means I can able to ssh to these machines without password. And I have the sshloginfile at my home directory with all the above domains with the configuration on how many threads should be running
4/ dev1.mydomain.com
4/ dev2.mydomain.com
...
4/ dev3.mydomain.com
Now I need a shell script which should asynchronously fire the parallel command to grep the given text which should,
I have tried this script. But not sure what I'm doing wrong it is not producing any results.
#!/bin/bash
log_directory="/path/to/log/files" directory
start_date="2024-01-19"
end_date="2024-01-19"
output_file="parsed_logs.txt"
for log_file in $(ls $log_directory/mylog-$start_date*); do
date_part=$(echo $log_file | grep -oP '\d{4}-\d{2}-\d{2}')
if [[ "$date_part" >= "$start_date" && "$date_part" <= "$end_date" ]]; then
dev_server=$(grep "$date_part" ~/sshlogins | awk '{print $2}')
parallel --sshloginfile ~/sshlogins -S $dev_server -j4 "grep 'your_search_pattern' {} >> $output_file" ::: $log_file
fi
done
echo "Parsing completed. Results are stored in $output_file."
Upvotes: 0
Views: 112
Reputation: 33740
From your description something like this may work:
[... select the dev server...]
[... set $date to 2024-02-21 ...]
ssh $dev parallel grep foo ::: /path/to/log/mylog-$date* > combined_output
If parallel
is not installed on $dev and the paths are the same:
[... select the dev server...]
[... set $date to 2024-02-21 ...]
eval parallel -S 4/$dev grep foo ::: /path/to/log/mylog-$date* > combined_output
Upvotes: 0
Reputation: 29
One aspect of what you are trying to do that isn't clear: what is on your local machine and what is on the remote server? I suspect this is related to your issue.
From your question, it sounds like everything is on the remote server. I'm not terribly familiar with using parallel's "distribute over ssh" features, but I believe that your code tries to run parallel
locally and set up the parallel call based on the log folder structure on your local machine and then just distribute the already setup/resolved grep
commands to run on the remote server.
For this to work you would need a "skeleton copy" of the log folder directory on the remote server on your local machine (i.e., a directory with the same structure and filenames as the remote server, but where all the log files were empty instead of filled with an absurd 4 tb of data each). Without this youd get no output since you would be looping over a directory that doesnt exist on your local machine.
Alternate solutions would be to either:
a) run the entire loop that checks the log folder directory and loops over a date range and runs the parallel command on the remote machine (e.g., via ssh) where the log folder directory actually exists, or
b) set up the parallel call by using ssh to query the log folder directory on the remote server. This would look something like replacing ls /path/to/log/dir
with ssh user@server 'ls /path/to/log/dir'
.
Upvotes: 0