Eric
Eric

Reputation: 2705

Printing only lines with timestamps larger than a input timestamp?

I am using Python to process outputs from hadoop filesystem which contain timestamp and a file name.

Since the output is long, I want to print only the lines that has not been checked.

For this, I am going to store a timestamp last_ts, which is the last time the system was checked.

To print the whole output, I am using the command

hadoop fs -ls /path/to/donemarkerfiles/ | sort -k 6 |awk '{print $6" "$7" "$8} '

where 6 is the day, 7 is the time, and 8 is the content.

I want to compare 6 and 7 with last_ts, and print only certain lines.

How can I do this? I tried to use if condition in awk, but I stumbled a lot and gave up.


Sample output :

2014-06-23 05:45 /user/hdfs/warehouse/donemarkers/20140621_basic.done
2014-06-23 07:13 /user/hdfs/warehouse/donemarkers/20140621_stat.done
2014-06-23 08:08 /user/hdfs/warehouse/donemarkers/20140621_raw.done
2014-06-23 09:30 /user/hdfs/warehouse/donemarkers/20140621_join.done
2014-06-23 09:31 /user/hdfs/warehouse/donemarkers/20140621_upload_file.done
2014-06-23 15:52 /user/hdfs/warehouse/donemarkers/20140622_basic.done
2014-06-23 20:23 /user/hdfs/warehouse/donemarkers/20140622_stat.done
2014-06-23 21:40 /user/hdfs/warehouse/donemarkers/20140622_raw.done
2014-06-23 22:57 /user/hdfs/warehouse/donemarkers/20140622_join.done
2014-06-23 22:58 /user/hdfs/warehouse/donemarkers/20140622_upload_file.done

Upvotes: 0

Views: 182

Answers (1)

drysdam
drysdam

Reputation: 8637

This one was ridiculously hard to figure out, I guess mainly because I'm not an awk expert. However, this does the heavy lifting:

cat /tmp/data | awk '{"date -d \"" $1 " " $2 "\" +%s"|getline secs; print secs, $0}'

Use awk to call the standard date util to format as epoch secs, and assign that return value to an awk variable with getline. In my command I just printed everything out, now let's do the filtering.

cat /tmp/data | awk '{"date -d \"" last_ts "\" +%s"|getline mindate; "date -d \"" $1 " " $2 "\" +%s"|getline secs; if (secs > mindate) print $0}'

Now we've got two things of the form cmd|getline var in there, which seems unwieldy. I would put that in a script (with a comment) but I'd never type it 'live'. Also, I'm not clear where last_ts lives. In the awk script already?

Upvotes: 2

Related Questions