Reputation: 13
The tab separated logs records from one of my applications look like this:
Time UserId CustomField CityId
2015-06-16-12:36:39 _v0YurN20wyj5h5QNIfoKA st=prefooter300x253;aa=855677;aam=91363629792766391842337900189790343745;kw=4onews;kw=5vo1bw;kw=671l7s;sqt=4 1023191
2015-06-16-12:00:08 7ovC6FHLKjMxJpiZHvlDGQ st=xrailtop300x250;aam=86662686616919269952594761014252363053;kw=240000;kw=240001;kw=240002;kw=240003;kw=240004;kw=240005;kw=240006;kw=240007;kw=240008;px=240002;px=240003;sov=4;sqt=4 1028057
2015-06-16-12:04:41 ZBV9KBZjMmkOcst7j2r8wA st=yrailtop300x250;aam=67657135077785797411906987077419372156;kw=top_of_the_rock_news;rfsh=0;sov=14;sqt=9 1025202
2015-06-16-13:05:42 ABf9KBZjMmkOcst7j2r8w4 st=yrailtop300x250;aam=95657135077785797411906987077419372142;kw=liquid_cow_found_on_Mars;kw=2305;kw=stars_don't_care_about_astronomy;rfsh=0;sov=14;sqt=9 1025202
2015-06-16-13:05:42 1tf9KBZjMmkOcst7j2r8y2 st=yrailtop300x250;kw=liquid_cow_found_on_Mars;rfsh=0;sov=14;sqt=9 1025202
I need to use awk to pre-process before ingesting into a database. Want to keep only Time, UserID, as well as parts of the CustomField (always "aam" value when is present, and "kw" value only when the string is longer that 16 char ). I can probably leave out the kw part or deal with it later.
Edit: The desired output would look like this
Time UserId RecordNo NewsItem1 NewsItem2
2015-06-16-12:36:39 _v0YurN20wyj5h5QNIfoKA aam=91363629792766391842337900189790343745 NA NA
2015-06-16-12:00:08 7ovC6FHLKjMxJpiZHvlDGQ aam=86662686616919269952594761014252363053 NA NA
2015-06-16-12:04:41 ZBV9KBZjMmkOcst7j2r8wA aam=67657135077785797411906987077419372156 kw=top_of_the_rock_news
2015-06-16-13:05:42 ABf9KBZjMmkOcst7j2r8w4 aam=95657135077785797411906987077419372142 kw=liquid_cow_found_on_Mars kw=stars_don't_care_about_astronomy
2015-06-16-13:05:42 1tf9KBZjMmkOcst7j2r8y2 NA kw=liquid_cow_found_on_Mars NA
Edit2: I accepted the answer. Following Ed suggestion, I added two more records not present in original post, covering unusual records (no aam value or multiple legitimate kw values). If multiple kw values are found, only the first two will be kept in NewsItem1 and NewsItem2, the rest will be ignored
Upvotes: 0
Views: 70
Reputation: 203169
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
if (NR==1) {
aam = "RecordNo"
kw = "NewsItem"
}
else {
aam = kw = "NA"
split($3,a,/;/)
for (i=1; i in a; i++) {
if (a[i] ~ /^aam/) {
aam = a[i]
}
if ( (a[i] ~ /^kw/) && (length(a[i])>16) ) {
kw = a[i]
}
}
}
print $1, $2, aam, kw
}
$ awk -f tst.awk file
Time UserId RecordNo NewsItem
2015-06-16-12:36:39 _v0YurN20wyj5h5QNIfoKA aam=91363629792766391842337900189790343745 NA
2015-06-16-12:00:08 7ovC6FHLKjMxJpiZHvlDGQ aam=86662686616919269952594761014252363053 NA
2015-06-16-12:04:41 ZBV9KBZjMmkOcst7j2r8wA aam=67657135077785797411906987077419372156 kw=top_of_the_rock_news
You didn't say or show what you want to happen if multiple kw values longer than 16 chars are present or what you want to do if aam is absent. If either of those can happen, edit the sample input/output in your question to show it.
Upvotes: 2