Reputation: 937
I have a (very big) csv-file with following format:
id;surname;firstname;aliases
1;Simpson;Homer;Homer Jay Simpson,Homer J. Simpson
2;Simpson;Bart;Bartholomew JoJo Simpson,Bartholomew Simpson
3;Krusty the Clown;;Herschel Shmoikel Pinchas Yerucham Krustofsky
4;Simpson;Lisa;
Now I want to transform that into following format:
id;name
1;Homer Simpson
1_1;Homer Jay Simpson
1_2;Homer J. Simpson
2;Bart Simpson
2_1;Bartholomew JoJo Simpson
2_2;Bartholomew Simpson
3;Krusty the Clown
3_1;Herschel Shmoikel Pinchas Yerucham Krustofsky
4;Lisa Simpson
For performance reasons I would like to do that with awk
or another UNIX-command line tool.
With awk -F ';' '{print $1, $3, $2}'
I can separate the semicolon separated line. But how do I use awk
in awk
for splitting the comma separated entry again?
Upvotes: 1
Views: 146
Reputation: 189327
Awk has a split
function which lets you split strings into arrays.
awk -F ';' 'BEGIN { OFS=FS }
{ print $1, $3 " " $2
n = split($4, alias, /,/)
for(i=1; i<=n; i++)
print $1 "_" i, alias[i] }' file.csv
The return value from split
tells you how many members there are in the result array.
Upvotes: 0
Reputation: 5056
This will do as you intend in Python 3. Notice that I typed it quickly so many improvements can be done. I believe it might be faster than awk, but I might be wrong. You can test if so using time command in Linux and Mac.
#!/usr/local/bin/python3
import csv
csvr = csv.reader(open('simpsons.csv'), delimiter = ";")
index=0
for row in csvr:
if index == 0:
index = index +1
continue
print("{};{} {}".format(index,row[2],row[1]))
sindex=0
for sitem in row[3].split(','):
if sitem != "" :
sindex = sindex + 1
print("{};{}".format(row[0] + "_" + str(sindex),sitem))
index = index +1
Hope it helps!
I generated a dummy list of 500k rows and tested some of the answers given by users here and it does not seem to be any important difference between Python 3 and awk. ( At least in my poor implementation in Python 3).
$ time awk -f tst.awk fivehundredthousand.txt &> /dev/null
real 0m2.141s
user 0m2.118s
sys 0m0.020s
$ time ./handle_csv.py >/dev/null
real 0m1.750s
user 0m1.722s
sys 0m0.021s
$ time awk -f ravinder.awk fivehundredthousand.txt &> /dev/null
real 0m1.736s
user 0m1.718s
sys 0m0.017s
Upvotes: 0
Reputation: 203229
$ cat tst.awk
BEGIN { FS=OFS=";" }
NR==1 {
print $1, "name"
next
}
{
name = $3 " " $2
gsub(/^ +| +$/,"",name)
print $1, name
n = split($NF,aliases,/,/)
for (i=1; i<=n; i++) {
print $1 "_" i, aliases[i]
}
}
$ awk -f tst.awk file
id;name
1;Homer Simpson
1_1;Homer Jay Simpson
1_2;Homer J. Simpson
2;Bart Simpson
2_1;Bartholomew JoJo Simpson
2_2;Bartholomew Simpson
3;Krusty the Clown
3_1;Herschel Shmoikel Pinchas Yerucham Krustofsky
4;Lisa Simpson
Upvotes: 4
Reputation: 133458
Could you please try following (written and tested with shown samples).
awk '
BEGIN{
FS="[;,]"
OFS=";"
print "id;name"
}
FNR>1{
j=$2~/ /?2:3
for(i=j;i<=NF;i++){
if($i==""){
continue
}
if(i==j){
print $1,$3" "$2
}
else{
print $1"_"++c,$i
}
}
c=""
}' Input_file
Output will be as follows.
id;name
1;Homer Simpson
1_1;Homer Jay Simpson
1_2;Homer J. Simpson
2;Bart Simpson
2_1;Bartholomew JoJo Simpson
2_2;Bartholomew Simpson
3; Krusty the Clown
3_1;Herschel Shmoikel Pinchas Yerucham Krustofsky
4;Lisa Simpson
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS="[;,]" ##Setting field as either semi-colon OR comma for all lines.
OFS=";" ##Setting output field separator semi-colon.
print "id;name" ##Printing id;name string before reading Input_file.
} ##Closing BLOCK for BEGIN block of this awk program here.
FNR>1{ ##Checking condition if FNR>1 then do following.
j=$2~/ /?2:3
for(i=j;i<=NF;i++){ ##Running a for loop from i=j to till number of fields of line.
if($i==""){ ##Checking condition if current field value is NULL then do following.
continue ##Using continue to take cursor to for loop again here.
}
if(i==j){ ##Checking condition if i==3 then do following.
print $1,$3" "$2 ##Printing first, 3rd,space and 2nd field of line here.
}
else{ ##If above if condition is false then come to this else here.
print $1"_"++c,$i ##Printing first field underscore variable c value, value of current field here.
}
}
c="" ##Nullifying variable c here.
}
' Input_file ##Mentioning Input_file name here.
Upvotes: 2