intruder
intruder

Reputation: 417

How to parse contents of a file using sed/awk?

My input file has its content in following format, where each column is separated by a "space"

string1<space>string2<space>string3<space>YYYY-mm-dd<space>hh:mm:ss.SSS<space>string4<space>10:1234567890<space>0e:Apple 1.2.3.4<space><space>string5<space>HEX  

There are 2 "spaces" after "0e:Apple 1.2.3.4" because there is no 14th digit in this field/column. The entire "0e:Apple 1.2.3.4space" is treated as a single value of that column.

In the 7th column, 10: represents the count of characters in the following string.

In the 8th column, 0e: represents a hex value of 14. So, the HEX values mention the count of characters in the string that follows.

Like:

"0e:Apple 1.2.3.4 "--> this is the actual value in 8th column without " "  
    (I've mentioned " " to show that the 14th digit is empty)  

It's counted as  
0e:A p p l e   1 . 2 .   3  . 4    
   | | | | | | | | | |   |  | | |  
   1 2 3 4 5 6 7 8 9 10 11 12 1314  

Let's consider first row from the input file as:

string1 string2 string3 yyyy-mm-dd 23:50:45.999 string4 10:1234567890 0e:Apple 1.2.3.4  string5 001e  

where:

Expected output:

string1,string2,string3,yyyy-mm dd,23:50:50.999,string3,1234567890,Apple_1.2.3.4,string5,30  

Requirements:

  1. Eliminate the counts from 7th and 8th column (10: & 0e:)
  2. The space b/w Apple and 1.2.3.4 should be replace by "_"
  3. Hex value in the last column should be converted to decimal value.
  4. Replace the "space" between columns with ","
  5. I've used hex value only in 10th column here. What if it's in several columns? Any way to convert it specific to certain columns?

I've tried using this:

$ cat input.txt |sed 's/[a-z0-9].*://g'  

which gives output as:

string1,string2,string3,yyyy-mm-dd,45.999,string4,1234567890,Apple,1.2.3.4,,string5,001e  

Upvotes: 4

Views: 925

Answers (1)

leekaiinthesky
leekaiinthesky

Reputation: 5603

This will do what you want on your example input:

awk -F "[ ]" '{sub(/.*:/, "", $7) sub(/.*:/, "", $8); printf "%s,%s,%s,%s,%s,%s,%s,%s_%s,%s,%s,%d\n", $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, "0x"$12}' input.txt

Explanation of parts:

awk printf allows you to specify an output format, so you can manually specify which fields you want to delimit with , and which you want to delimit with _.

-F "[ ]" forces the field separator to be a single space so that it knows there is an empty field between two single spaces. The default behavior would be to allow multiple spaces to be a single delimiter, which is not what you want according to the question.

The sub function allows you to do regular expression replacement, in this case removing the ..: prefix in fields 7 and 8.

For field 12, we tell printf to output as a number (%d) and give as input the string in prefixed by 0x so that it interprets it as hexadecimal.

Note: If it's not always the case that you want the output to be $8_$9, then you actually need to parse the hexadecimal prefix and count off characters in order to determine where the field ends. If that's the case, I would personally prefer to write the whole thing in something else, e.g. Python.

Upvotes: 2

Related Questions