Reputation: 41
I have a file with a bunch of CSV lines with values with and without quotes like so :
"123","456",,17,"hello," how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a "meeting", unprepared while trying to be "awake","2018-05-29T18:58:10-05:00","ACD",
The fifth column is a text column which has escaped or unescaped double quotes. I am trying to get rid of all the quotes in this column so it looks like this
"123","456",,17,"hello, how are you this, fine, highly caffeinated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",
Any ideas how to achieve this using SED or AWK, or any other unix tools? Much appreciated!
Upvotes: 0
Views: 707
Reputation: 625
Try this regex :
,\d{2}\,(.*),\"\S{25}\",\"\w{3}"
It was made based on your examples. The goal is just to capture de fifth column. Like @Jerry Jeremiah suggested the point was to use the date wich will always be 25 char long. To prevent some missmatch I've also taken in account the 2 digits presents before the fifth and the 3 letters/digit after the date. Regex101v1
We can also use a "stronger" regex by looking for the exact date match
,\d{2}\,(.*),\"\d{4}-\d{2}-\d{2}\w\d{2}:\d{2}:\d{2}-\d{2}:\d{2}\",\"\w{3}"
With theses regex you'll be able to extract the fifth column using group. To go deeper in your question you can do this in bash :
regex='^(.*,[0-9]{2}\,")(.*)(",\"[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}-[0-9]{2}:[0-9]{2}\",\"[a-zA-Z]{3}".*$)'
while IFS= read -r line
do
if [[ $line =~ $regex ]]
then
before=${BASH_REMATCH[1]}
fifth=${BASH_REMATCH[2]}
after=${BASH_REMATCH[3]}
reworked_fifth="${fifth//\"}"
echo ${before}${reworked_fifth}${after}
else
echo "Line didnt match the regex"
fi
done < /my/file/path
I had to change the regex since my bash didn't take \d
and \w
. No need to sed or awk anything with this. Bash can handle it alone.
Upvotes: 0
Reputation: 5006
With awk, you can do something like this that avoid very complex regex. The fact that only the fifth column is broken, that the previous columns do not contain commas, and that we know there are a fixed number of columns make it easy to repair :
Edited using gsub
for portability as suggested by Ed Morton
awk '
BEGIN{FS=OFS=","}
{
for(i=6; i<=NF-3;i++){
$5 = $5 FS $i
}
}
{
gsub(/"/, "", "g", $5)
}
{print $1,$2,$3,$4,"\""$5"\"",$(NF-2),$(NF-1),$NF}
' <file>
Output :
"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",
If you want to escape quotes, you can use this :
awk '
BEGIN{FS=OFS=","}
{
for(i=6; i<=NF-3;i++){
$5 = $5 FS $i
}
}
{
gsub(/^"|"$/,"",$5);
gsub(/"/,"\\\"",$5);
$5="\""$5"\"";
}
{print $1,$2,$3,$4,$5,$(NF-2),$(NF-1),$NF}
' <file>
Output :
"123","456",,17,"hello,\" how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a \"meeting\", unprepared while trying to be \"awake","2018-05-29T18:58:10-05:00","ACD",
Upvotes: 2
Reputation: 203209
With GNU awk for the 3rd arg to match() and assuming you know how many fields there should be in each line:
$ cat tst.awk
BEGIN {
numFlds = 8
badFldNr = 5
}
match($0,"^(([^,]*,){"badFldNr-1"})(.*)((,[^,]*){"numFlds-badFldNr"})",a) {
gsub(/"/,"",a[3])
print a[1] "\"" a[3] "\"" a[4]
}
$ awk -f tst.awk file
"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",
With other awks you can do the same with a couple of calls to match() and variables instead of the array.
Upvotes: 0
Reputation: 26471
Your question is very difficult to answer in a generic way. To give an example:
"a","b","c","d"
How is this interpreted (if we remove the quotes from the fields of interest):
"a","b","c","d" (4 fields)
"a,b","c","d" (3 fields, $1 messed up)
"a","b,c","d" (3 fields, $2 messed up)
"a","b","c,d" (3 fields, $3 messed up)
"a,b,c","d" (2 fields, $1 messed up)
"a,b","c,d" (2 fields, $1 and $2 messed up)
"a","b,c,d" (2 fields, $2 messed up)
"a,b,c,d" (1 field , $1 messed up)
The only way this can be solved is by having the following knowledge:
The following awk program will help you fix it:
$ awk 'BEGIN{ere="[^,]*|\042[^\042]"}
{ head=tail=""; mid=$0 }
# extract the head which is correct
(n>1) {
ere_h="^"
for(i=1;i<n;++i) ere_h = ere_h (ere_h=="^" ? "",",") "(" ere ")"
match(mid,ere_h); head=substr(mid,RSTART,RLENGTH)
mid = substr(mid,RLENGTH+1)
}
# extract the tail which is correct
(nf>n) {
ere_t="$"
for(i=n+1;i<=nf;++i) ere_t = "(" ere ")" (ere_h=="$" ? "",",") ere_t
match(mid,ere_t); tail=substr(mid,RSTART,RLENGTH)
mid = substr(mid,1,RSTART-1)
}
# correct the mid part
{ gsub(/\042/,"",mid)
mid = (mid ~ /^,/) ? ( ",\042" substr(mid,2) ) : ( "\042" mid )
mid = (mid ~ /,$/) ? ( substr(mid,1,length(mid)-1) "\042," ) : (mid "\042" )
}
# print the stuff
{ print head mid tail }' n=5 nf=7 file
Upvotes: 0