Umesh Kacha
Umesh Kacha

Reputation: 13686

How to find the longest column(hence line) in csv file in linux?

I have a csv file which contains millions of rows. Now few of the rows contains more data then column data type can accommodate. For e.g. csv file has only two rows as shown below please not ^_ is delimiter

A^_B^_C
AA^_BB^_CC

Now assume each line can accomodate only one character so line 1 row 1 is correct but line 2 is not and I want to find out all these kinds of lines.

So I thought if I get the longest lines from csv file I will be good and I tried to following but is not helping (from longest line)

wc -L file

Please help me find the largest line/column in a csv file. Another problem is I have two delimiter so cant use cut command also.

Upvotes: 0

Views: 2099

Answers (3)

n0741337
n0741337

Reputation: 2514

Here's an answer that requires defining the column lengths in a single line file using the same delimiter as the data ( assuming different columns can have different acceptable lengths ):

1^_1^_1

Using that file (which I called clengths) and using split() in an lazy way to get indexed elements:

awk -F'\\^_' '

  NR==FNR {split($0,clen,FS); next} # store the lengths

  {
    split($0,a,FS); # put the current line in an array
    for( i in a )
    { if( length(a[i]) > clen[i] ) print "["FNR","i"] = "a[i] }
  }
' clengths data

This outputs array styled indexes for the long data as [row, col] aka [line #, field #] starting at [1,1]:

 [2,1] = AA
 [2,2] = BB
 [2,3] = CC

Everything in the output is "too big" and is indexed to make finding it again easier.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203985

awk -F'\\^_' -v OFS=':' '
{
    for (i=1;i<=NF;i++) {
        if (length($i) > max) {
            max = length($i)
            lineNr = NR
            line   = $0
            fldNr  =  i
            fld    = $i
        }
    }
}
END {
    print lineNr, line
    print fldNr, fld
}
' file

Upvotes: 2

jaypal singh
jaypal singh

Reputation: 77135

You can try something like:

awk '
{
for(i=1;i<=NF;i++) 
    if (length($i) == good) { continue } 
    else {
        print "Row "NR" contained data more than " good" in a single field"
        next
    }
    print "Row "NR " is valid"
}' FS='\\^_' good="1" csv
Row 1 is valid
Row 2 contained data more than 1 in a single field

Explanation:

  • We set the Field Separator to \\^_ (note we need to escape ^ since it is a special character).
  • We iterate over each field
  • We check the length of the field with the variable (good) passed at run time.
  • If the field is good, we continue the loop to the next field and continue to do so until the last field of the row.
  • If any field has a size greater than good we print our message with row number using NR.
  • If all the fields are good then we print the message stating row is good.

If I misunderstood your question, feel free to leave a comment.

Upvotes: 4

Related Questions