Reputation: 449

Separating lines of a huge file into two files depending on the date

I'm gathering tones of data in a stream on an Ubuntu machine, the data is stored in days packages (where each day_file contains somewhere between 1 and 5 gb). I'm not an experienced linux/bash/awk user, but the data looks something like this (all lines start with a date):

2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way

Now to the problem, the stream is cut around midnight local time (for a few reasons it can't be cut at exact 00.00.00 gtm time). This means that rows from two dates are stored in the same file and I want to separate them into the correct date files. I wrote the following script trying to separate the rows, it works but it takes several hours to run and I think that there must be a faster way of doing this operation?

#!/bin/bash

dateDiff (){
    line_str="$1"
    dte1="2020-09-01"
    dte2=${line_str:0:10}
    if [[ "$dte1" == "$dte2" ]]; then 
        echo $line_str >> correct_date.txt; 
    else 
        echo $line_str >> wrong_date.txt; 
    fi
}

IFS=$'\n'
for line in $(cat massive_file.txt)
do
    dateDiff "$line"
done
unset IFS

Upvotes: 0

Answers (3)

rcwnd_cz

Reputation: 1016

Using this awk script I'm able to process 10GB file in approx 1 minute on my machine.

awk '{ if ($0 ~ /^2020-08-31/) { print $0 > "correct.txt" } else { print $0 > "wrong.txt" }  }' input_file_name.txt

Line is checked against regular expression containing your date, then whole line is printed to file based on regexp match.

Upvotes: 3

thanasisp

Reputation: 5975

Some notes:

Using a bash loop to read logs would be very slow.
Using awk, sed, grep or similar is very good, but still you will have to read and write whole files line by line, and this has a perfomance ceiling.
For your specific case, you could only identify the split points, which can be 3, not only 2, (previous, current and next day logs can co-exist in a file) with something like grep -nm1 "^$day" and then split the log file with a combination of head and tail, like this. Then append or prepend them to the existing ones. This would be a very fast solution because you would write the files massively, not line by line.

Here is a simple solution with grep, as you need to test only the 10 first characters of the log lines, and for this job grep is faster than awk.

Assuming that you store logs in a destination directory, every incoming file should pass from something like this script. Order of processing is important, you have to follow date order of the files, e.g. you see that I append to an existing file. This is just a demo solution for guidance.

#!/bin/bash

[[ -f "$1" ]] && f="$1" || { echo "Nothing to do"; exit 1; }

dest_dir=archive/
suffix="_file.log"

curr=${f:0:10}
prev=$( date -d "$curr -1 day" "+%Y-%m-%d" )
next=$( date -d "$curr +1 day" "+%Y-%m-%d" )

for d in $prev $curr $next; do
    grep "^$d" "$f" >> "${dest_dir}${d}${suffix}"
done

Upvotes: 0

jas

Reputation: 10865

Using awk with T as your field separator, the first field, $1, will be the date. Then you can output each record to a file named for the date.

$ cat file
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way

$ awk -FT '{ print > ($1 ".txt") }' file

$ ls 20*.txt
2020-08-31.txt  2020-09-01.txt

$ cat 2020-09-01.txt 
2020-09-01T00:00:00Z !In a unreadable way

$ cat 2020-08-31.txt 
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data

Upvotes: 1

Separating lines of a huge file into two files depending on the date

Answers (3)

Related Questions