Reputation: 13686
I have a csv file which contains millions of rows. Now few of the rows contains more data then column data type can accommodate. For e.g. csv file has only two rows as shown below please not ^_ is delimiter
A^_B^_C
AA^_BB^_CC
Now assume each line can accomodate only one character so line 1 row 1 is correct but line 2 is not and I want to find out all these kinds of lines.
So I thought if I get the longest lines from csv file I will be good and I tried to following but is not helping (from longest line)
wc -L file
Please help me find the largest line/column in a csv file. Another problem is I have two delimiter so cant use cut command also.
Upvotes: 0
Views: 2099
Reputation: 2514
Here's an answer that requires defining the column lengths in a single line file using the same delimiter as the data ( assuming different columns can have different acceptable lengths ):
1^_1^_1
Using that file (which I called clengths
) and using split()
in an lazy way to get indexed elements:
awk -F'\\^_' '
NR==FNR {split($0,clen,FS); next} # store the lengths
{
split($0,a,FS); # put the current line in an array
for( i in a )
{ if( length(a[i]) > clen[i] ) print "["FNR","i"] = "a[i] }
}
' clengths data
This outputs array styled indexes for the long data as [row, col]
aka [line #, field #]
starting at [1,1]
:
[2,1] = AA
[2,2] = BB
[2,3] = CC
Everything in the output is "too big" and is indexed to make finding it again easier.
Upvotes: 0
Reputation: 203985
awk -F'\\^_' -v OFS=':' '
{
for (i=1;i<=NF;i++) {
if (length($i) > max) {
max = length($i)
lineNr = NR
line = $0
fldNr = i
fld = $i
}
}
}
END {
print lineNr, line
print fldNr, fld
}
' file
Upvotes: 2
Reputation: 77135
You can try something like:
awk '
{
for(i=1;i<=NF;i++)
if (length($i) == good) { continue }
else {
print "Row "NR" contained data more than " good" in a single field"
next
}
print "Row "NR " is valid"
}' FS='\\^_' good="1" csv
Row 1 is valid
Row 2 contained data more than 1 in a single field
Explanation:
\\^_
(note we need to escape ^ since it is a special character).good
) passed at run time.good
we print our message with row number using NR
.If I misunderstood your question, feel free to leave a comment.
Upvotes: 4