Reputation: 385
I have a nginx log file that I want to split in multiple files based on Ips. For instance, I have ips1.txt
and ips2.txt
. Each file has half of the number of unique ips of the log file. The nginx log file has the following format:
172.0.0.10 - [24/Jun/2018:11:00:00 +0000] url1 GET url2 HTTP/1.1 (200) 0.000 s 2356204 b url3 - - [HIT] - s - Mozilla/5.0 (X11; CrOS x86_64 10452.99.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.203 Safari/537.36
172.0.0.11 - [24/Jun/2018:11:00:00 +0000] url1 GET url2 HTTP/1.1 (200) 0.000 s 307 b url3 - - [HIT] - s - Mozilla/5.0 (X11; CrOS x86_64 10452.99.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.203 Safari/537.36
So, what I am doing to get all lines that starts with the IP that is inside my pattern file is:
cat log.txt | grep -f ips1.txt > part1.txt
cat log.txt | grep -f ips2.txt > part2.txt
I know that the grep I'm doing is searching in the whole line and not just in the beggining. It's making the search slower and wasting more memory than it could be doing. I know if I have just one pattern to look for I could use awk
(e.g. awk '{if($1 == "172.0.0.10")print;}' log.txt
) but I don't know how to do this with a pattern file using grep.
So what I want is to waste less memory and make the search faster by looking just in the beginning of the line. My log file has many GBs and if it's possible I will save much time.
EDIT:
My ips*.txt files are being generated based on the number of threads I have. You can see bellow how my code is:
NUM_THREADS=8
export LC_ALL=C
unpigz -c log.gz | awk '{print $1;}' | LC_ALL=C sort -S 20% -u > all_ips.txt
lines_arq=$(wc -l all_ips.txt | cut -d' ' -f1)
lines_each_file=$(($lines_arq / $NUM_THREADS + 50))
split --lines=$lines_each_file all_ips.txt 2018/prefixo.
zgrep log.gz -Fwf 2018/prefixo.aa | pigz > file1.gz &
zgrep log.gz -Fwf 2018/prefixo.ab | pigz > file2.gz &
...
zgrep log.gz -Fwf 2018/prefixo.ah | pigz > file8.gz &
wait
unpigz -c file1.gz | pypy script.py -i - -o onOff -s 11:00:00 -m testing -y 2018 | pigz > onOff-file1.gz &
...
unpigz -c file8.gz | pypy script.py -i - -o onOff -s 11:00:00 -m testing -y 2018 | pigz > onOff-file8.gz &
Upvotes: 1
Views: 1080
Reputation: 212178
Use awk for the whole thing. Read your fixed strings first, then split the log. eg:
awk '{out[$1] = FILENAME ".out"}
END {while (getline < input) { print > out[$1] }}
' input=log.txt ips[12].txt
Reading the input file multiple times is going to kill your performance far more than the overhead of awk splitting the lines unnecessarily.
A brief explanation of the code follows. The first (and only) command is to read the input and build an array of filenames. The list all of the ips*.txt is given as input, so those lines are read into the array. Ideally, these files are relatively small, so building this array won't take a lot of ram. After the array is built, you enter the END clause where you read the log file (only once!) and write each line to the appropriate file.
It seems that you want to generate the ips*.txt dynamically, and that you just want to distribute the log. In that case, try something like:
awk '! ($1 in out) {out[$1] = (idx++ %10) }
{ outfile= "output." out[$1] ".txt"; print > outfile ; next} ' log.txt
This simply checks if you've already seen an ip: if you have already seen it then write it out to the same file you wrote the previous log. If not, increment a counter (mod 10...pick your modulus depending on how many files you want) and write to that file, recording where you are writing the line. Repeat for each line in the log.
The key here is to minimize the number of times you read the log.
Upvotes: 2
Reputation: 27195
Here are some ideas to speed up your commands. Be sure to benchmark them. I was missing the data to benchmark them myself.
zgrep file
over unpigz -c file | grep
LC_ALL=C zgrep ...
-F
together with word regexes -w
. Fixed string search should be a bit faster then the default basic regex search. For the fixed string case word regexes are the closest thing you can get to »search only at the beginning of the line«.grep -Fwf ip...
.or
^
to the beginning to search only at the beginning of lines. Then use either grep -E
or grep -P "$regex"
/pcregrep "$regex"
. Speed of -E
and -P
can differ quite a lot. Check both to see which one is faster.regex="$(tr \\n \| < ips1.txt | sed 's/^/^(/;s/\./\\./g;s/$/)/')"
zgrep -E "$regex" yourfile > part1.txt
zgrep -Ev "$regex" yourfile > part2.txt
Upvotes: 2