Pawan Mude
Pawan Mude

Reputation: 1659

How to count occurence of a string in csv file

I've CSV file

author,host,authority,contents
_angelsuman,http://twitter.com/_angelsuman,5,green tea piyo :( #kicktraileron6thjune
_angelsuman,http://twitter.com/_angelsuman,5,rt @121training fat burning foods: grapefruit  watermelon  berries  hot peppers  celery  greek yogurt  eggs  fish  green tea  coffee  water  oatmeal.
_angelsuman,http://twitter.com/_angelsuman,5,rt @121training fat burning foods: â´ grapefruit â´ watermelon â´ berries â´ hot peppers â´ celery â´ greek yogurt â´ eggs â´ fish â´ green tea â´ oatmeal
anukshp,http://twitter.com/anukshp,4,rt @_angelsuman dear green tea u suck..:/ but i need to sip uh for myh rsn :( zindagi ka kdwa such :/ :(

I want to identify count of occurrences of first column :"author" in fourth column "contents"

Ex: finding "_angelsuman" in contents.

Kindly suggest; how can i achieve same?

Upvotes: 0

Views: 874

Answers (2)

Artur Siara
Artur Siara

Reputation: 166

You can do it as follows (assume that there is no comma in values as you said).

one-line:

awk -F, 'NR>1 {author[$1]=0; content[NR]=$4} END {for (a in author) {for (c in content) {count[a]+=gsub(a,"",content[c])} print a, count[a]}}' file

expanded:

awk -F, '
    NR>1 {
        author[$1]=0;
        content[NR]=$4
    }
    END {
        for (a in author) {
          for (c in content) {
              count[a] += gsub(a,"",content[c])
          }
          print a, count[a]
        }
    }' file

How it works

  • read file with a comma separator -F, and skip the first line NR>1

    awk -F, 'NR>1

  • store the first column in array author as the key - so every unique value will be stored once. Store the content in array content with the key equal of line number NR - this results of storing content from every line.

    {
    author[$1]=0;
    content[NR]=$4
    }
    
  • at the end iterate by every unique author for (a in author) and foreach author iterate by content for (c in content) and increase number of occurences of author in content for specific author count[a]+=gsub(a,"",content[c]). If it is counted for specific author, then print results print a, count[a].

    END {
        for (a in author) {
          for (c in content) {
            count[a]+=gsub(a,"",content[c])
          }
          print a, count[a]
        }
    }' file
    

Output

_angelsuman 1
anukshp 0

Upvotes: 1

Gilles Quénot
Gilles Quénot

Reputation: 185053

Using :

use Text::CSV;

my $col = 4; // 4th column

my $count = 0;
my @rows;
my $csv = Text::CSV->new ( { binary => 1 } )  # should set binary attribute.
    or die "Cannot use CSV: ".Text::CSV->error_diag ();

open my $fh, "<:encoding(utf8)", "/tmp/test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
    if ($row->[$col -1] eq 'author') {
        $count++;
    }
}
$csv->eof or $csv->error_diag();
close $fh;
print "There's $count occurences of 'author'\n";

Output :

There's 1 occurences of 'author'

Note :

This is a proper parsing with a perl module.

Replace /tmp/test.csv by your own file

Upvotes: 1

Related Questions