SonicProtein
SonicProtein

Reputation: 850

Awk splitting a line by spaces where there are spaces in each field

I've got an R summary table like so:

       employee     salary        startdate
 John Doe  :1   Min.   :21000   Min.   :2007-03-14
 Jolie Hope:1   1st Qu.:22200   1st Qu.:2007-09-18
 Peter Gynn:1   Median :23400   Median :2008-03-25
                Mean   :23733   Mean   :2008-10-02
                3rd Qu.:25100   3rd Qu.:2009-07-13
                Max.   :26800   Max.   :2010-11-01

and I need to produce an output csv file like so:

employee,,salary,,startdate,,
John Doe,1,Min.,21000,Min.,2007-03-14
Jolie Hope,1,1st Qu.,22200,1st Qu.,2007-09-18
Peter Gynn,1,Median,23400,Median,2008-03-25
,,Mean,23733,Mean,2008-10-02
,,3rd Qu.,25100,3rd Qu.,2009-07-13
,,Max.,26800,Max.,2010-11-01

so that in excel it looks something like this:

output in excel

However it doesn't suffice to split the fields by one or more white spaces,

 awk -F "[ ]+" '{ print $3 }'

It works for the header, but not for the remaining lines:

salary
Doe
Hope:1
Gynn:1
:23733
Qu.:25100
:26800

Is this problem solvable using awk (and maybe sed)?

Upvotes: 2

Views: 394

Answers (2)

NeronLeVelu
NeronLeVelu

Reputation: 10039

sed '1 {
   s/^[[:space:]]*\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}[[:space:]]\{1,\}\([^[:space:]]\{1,\}\)/\1,,\2,,\3,/
   b
   }

s/[[:space:]]\{1,\}:/:/g

/^[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\(.[^[:space:]]*\)/ {
   s//\1,\2,\3,\4,\5,\6/
   b
   }
/^[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)/ {
   s//,,\1,\2,\3,\4/
   b
   }
' YourFile

sed one, just for the fun if you need to adapt a bit in this ArachnoRegEx
awk is lot more interesting in this case mainly for any adaptation to add later but if you only have access to sed ...

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 203169

This uses GNU awk for FIELDWIDTHS, etc. and relies on the first line of input after the header always having all fields populated. It includes the positions that are just :s as output fields, I expect you can figure out how to skip those if you do want to use this solution:

$ cat tst.awk
BEGIN { OFS="," }
NR==1 {
    for (i=1;i<=NF;i++) {
        printf "%s%s", $i, (i<NF?OFS OFS OFS:ORS)
    }
    next
}
NR==2 {
    tail = $0
    while ( match(tail,/([^:]+):(\S+(\s+|$))/,a) ) {
        FIELDWIDTHS = FIELDWIDTHS length(a[1]) " 1 " length(a[2]) " "
        tail = substr(tail,RSTART+RLENGTH)
    }
    $0 = $0
}
{
    for (i=1;i<=NF;i++) {
        gsub(/^\s+|\s+$/,"",$i)
    }
    print
}

$ awk -f tst.awk file
employee,,,salary,,,startdate
John Doe,:,1,Min.,:,21000,Min.,:,2007-03-14
Jolie Hope,:,1,1st Qu.,:,22200,1st Qu.,:,2007-09-18
Peter Gynn,:,1,Median,:,23400,Median,:,2008-03-25
,,,Mean,:,23733,Mean,:,2008-10-02
,,,3rd Qu.,:,25100,3rd Qu.,:,2009-07-13
,,,Max.,:,26800,Max.,:,2010-11-01

Upvotes: 1

Related Questions