Reputation: 902
My shell is a bit rusty so I would greatly appreciate some help in parsing the following data.
Each row in the input file contains data separated by comma.
[name, record_timestamp, action, field_id, field_name, field_value, number_of_fields]
The rows are instructions to create or update information about persons. So for example the first line says that the person John Smith will be created and that the following 6 rows will contain information about him.
The field_id number always represent the same field.
input.csv
John Smith,2017-03-03 11:56:02,create,,,,6
,,,,1,BIRTH_DATE,1985-02-16,,
,,,,2,BIRTH_CITY,Portland,,
,,,,3,SEX,Male,,
,,,,5,CITY,Seattle,,
,,,,7,EMPLOYER,Microsoft,,
,,,,9,MARRIED,Yes,,
Susan Anderson,2017-03-01 12:09:36,create,,,,8
,,,,1,BIRTH_DATE,1981-09-12,,
,,,,2,BIRTH_CITY,San Diego,,
,,,,3,SEX,Female,,
,,,,5,CITY,Palo Alto,,
,,,,7,EMPLOYER,Facebook,,
,,,,8,SALARY,5612,,
,,,,9,MARRIED,No,,
,,,,10,TELEPHONE,5107586290,,
Brad Bradly,2017-02-29 09:15:12,update,,,,3
,,,,3,SEX,Male,,
,,,,7,EMPLOYER,Walmart,,
,,,,9,MARRIED,No,,
Sarah Wilson,2017-02-28 16:21:39,update,,,,5
,,,,2,BIRTH_CITY,Miami,,
,,,,3,SEX,Female,,
,,,,7,EMPLOYER,Disney,,
,,,,8,SALARY,5110,,
,,,,9,MARRIED,Yes,,
I want to parse each of these persons into comma separated strings that looks like this:
name,birth date,birth city,sex,employer,salary,marrage status,record_timestamp
but we should only output such a string if both the fields birth date and birth city or both the fields employer and salary are available for that person. Otherwise just leave it empty (see example below).
Given our input above the output should then be
John Smith,1985-02-16,Portland,Male,,,Yes,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,Facebook,5612,No,2017-03-01 12:09:36
Sarah Wilson,,,Female,Disney,5110,Yes,2017-02-28 16:21:39
I've figured out that I should probably do something along the following lines. But then I cannot figure out how to implement an inner loop or if there is some other way to proceed.
#!/bin/bash
IFS=','
cat test.txt | while read -a outer
do
echo ${outer[0]}
#...
done
Thanks in advance for any advice!
Upvotes: 0
Views: 428
Reputation: 203491
A UNIX shell is an environment from which to call UNIX tools (and manipulate files and processes) with a language to sequence those calls. It is NOT a tool to manipulate text.
The standard UNIX tool to manipulate text is awk:
$ cat tst.awk
BEGIN {
numFlds=split("name BIRTH_DATE BIRTH_CITY SEX EMPLOYER SALARY MARRIED timestamp",nr2name)
FS=OFS=","
}
$1 != "" {
prtRec()
rec["name"] = $1
rec["timestamp"] = $2
next
}
{ rec[$6] = $7 }
END { prtRec() }
function prtRec( fldNr) {
if ( ((rec["BIRTH_DATE"] != "") && (rec["BIRTH_CITY"] != "")) ||
((rec["EMPLOYER"] != "") && (rec["SALARY"] != "")) ) {
for (fldNr=1; fldNr<=numFlds; fldNr++) {
printf "%s%s", rec[nr2name[fldNr]], (fldNr<numFlds ? OFS : ORS)
}
}
delete rec
}
$ awk -f tst.awk file
John Smith,1985-02-16,Portland,Male,Microsoft,,Yes,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,Facebook,5612,No,2017-03-01 12:09:36
Sarah Wilson,,Miami,Female,Disney,5110,Yes,2017-02-28 16:21:39
Any time you have records consisting of name+value data like you do, the approach that results in by far the simplest, clearest, most robust, and easiest to enhance/debug code is to first populate an array (rec[]
above) containing the values indexed by the names. Once you have that array it's trivial to print and/or manipulate the contents by their names.
Upvotes: 2
Reputation: 6079
Avoid IFS hacks like the plague. They are ugly stuff.
Play with the -d option to read to specify the comma as delimiter.
Upvotes: 0
Reputation: 67497
awk
to the rescue!
awk -F, 'function pr(a) {if(!(7 in a && 8 in a)) a[7]=a[8]="";
if(!(1 in a && 2 in a)) a[1]=a[2]="";
for(i=0;i<=10;i++) printf "%s,",a[i];
printf "%s\n", a["ts"]}
NR>1 && $1!="" {pr(a); delete a}
$1!="" {a[0]=$1; a["ts"]=$2}
$1=="" {a[$5]=$7}
END {pr(a)}' file
this should cover the general case, and conditioned fields. You may want to filter out the other fields you don't need.
This will print for your input
John Smith,1985-02-16,Portland,Male,,Seattle,,,,Yes,,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,,Palo Alto,,Facebook,5612,No,5107586290,2017-03-01 12:09:36
Brad Bradly,,,Male,,,,,,No,,2017-02-29 09:15:12
Sarah Wilson,,,Female,,,,Disney,5110,Yes,,2017-02-28 16:21:39
Upvotes: 1
Reputation: 20002
Use awk
or something like
while IFS=, read -r name timestamp action f_id f_name f_value nr_fields; do
if [ -n "${name}" ]; then
# proces startrecord, store the fields you need for the next line
else
# process next record
fi
done < test.txt
Upvotes: 1