Googlebot
Googlebot

Reputation: 15683

How to check successive values in a CSV file by awk?

In a two-column CSV, I want to check the successive numbers (the values are integers):

1,1
4,1
5,1
6,1
7,1
10,1
11,1
4,2
5,2
6,2
10,2
5,3
5,7
5,8
5,9

I tried

awk -F, '
{
    if($1==a+1) { n+=1 } 
    else { n = 1 } 
{a=$1}
} {print $1"," n}
' file.txt

to produce

1,1,1
4,1,1
5,1,2
6,1,3
7,1,4
10,1,1
11,1,2
4,2,1
5,2,2
6,2,3
10,2,1
5,3,1
5,7,1
5,8,1
5,9,1

where the third column counts the successive rows with respect to the first column.

  1. Is my approach is correct (particularly performance-wise because it is a huge file)?
  2. What is the best way to calculate the fourth column based on the second column (counting the successive values in the second column)?

This is the expected output including the fourth column.

1,1,1,1
4,1,1,1
5,1,2,1
6,1,3,1
7,1,4,1
10,1,1,1
11,1,2,1
4,2,1,1
5,2,2,1
6,2,3,1
10,2,1,1
5,3,1,1 
5,7,1,1
5,8,1,2
5,9,1,3

My idea is to sort the produced 3-column CSV (sort -t, -k1,1n -k2,2n) and do the same thing with the second column, but I am not sure.

EDIT: I made a mistake in the first dataset. When we count the first column, the second column is constant, and vice versa.

Upvotes: 0

Views: 54

Answers (2)

anubhava
anubhava

Reputation: 785721

This awk should work for you:

awk 'BEGIN {
   FS=OFS=","
}
{
    $3 = ($1 == p1+1 ? $3+1 : 1)
    $4 = ($1 == p1 && $2 == p2+1 ? $4+1 : 1)
}
{
   p1 = $1
   p2 = $2
} 1' file.csv

1,1,1,1
4,1,1,1
5,1,2,1
6,1,3,1
7,1,4,1
10,1,1,1
11,1,2,1
4,2,1,1
5,2,2,1
6,2,3,1
10,2,1,1
5,3,1,1
5,7,1,1
5,8,1,2
5,9,1,3

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133680

Could you please try following, written and tested on shown samples(basically re-designed OP's code in more awkish way :) , thanks to OP for letting know in comments).

awk '
BEGIN{
  FS=OFS=","
}
{
  count=$1-prev==1 && prev?++count:"1"
  print $0,count
  prev=$1
}
'  Input_file

Explanation:

  • In this approach we haven't re-assigned current line and simply printing the value of current line along with count value, which should be less time taking.

Output will be as follows.

1,10,1
4,10,1
5,10,2
6,11,3
7,14,4
10,9,1
11,2,2

NOTE: In case you want to sort output after awk's calculations then we could further use | sort to perform sorting on 3rd column IMHO.

Upvotes: 2

Related Questions