Reputation: 475

How can I count the number of entries of a column based on the distinct entries of another column

My data in a file is like below with multiple columns:

A                B             

Tiger         Animal         
Parrot        Bird
Lion          Animal
Elephant      Animal
Crow          Bird
Horse         Animal
Man           Human
Dog           Animal

I want to find the number of entries in column A corresponding to distinct entries in column B. If possible in R or may be a perl script for this.

Output as:

Animal 5
Bird   2
Human  1

Moreover, if possible to find out if the entries in column A has been repeated for the distinct entries in column B like

A              B   
Tiger         Animal         
Tiger         Animal

Upvotes: 1

Answers (10)

Frank

Reputation: 66819

In case anyone else comes by here, here are a couple more approaches that work.

myout   <-  lapply(split(anm,list(anm$B)),function(x)
            list(length(unique(x[,"A"])),x[duplicated(x),"A"])
        )
unlist(sapply(myout,function(x)x[1])) # counts in each category
sapply(myout,function(x)x[-1]) # list of duplicated names

or....

library(data.table)
mydt <- data.table(anm,key="B")
mydt[,.N,by=key(mydt)]
mydt[,.N,by="B,A"][N>1]

where....

anm = read.table(textConnection(
    "Tiger    Animal         
    Parrot    Bird
    Lion      Animal
    Elephant  Animal
    Crow      Bird
    Horse     Animal
    Man       Human
    Dog       Animal
    Tiger     Animal"))
names(anm) <- c("A","B")

EDIT: Edited in response to comment by Matthew Dowle (author of data.table).

Upvotes: 2

Anonymous

Reputation: 21

Not sure I get the full data structure in the file, but if you're on UNIX:

tr -s ' ' | sort -u | awk '{ print $2}' | sort | uniq -c


5 Animal
2 Bird
1 Human

The above works, even if I add this line: "Tiger Animal" at the end, because of the first sort -u.

The tr -s squeezes out multiple blank spaces (so the sort commands act as expected)

Upvotes: 2

Axeman

Reputation: 29854

In Perl (strict and warnings implied.)

my ( %uniq, %count_for );
# here $fh = some input source
while ( <$fh> ) {
    s/^\s+//; # trim left
    s/\s*$//; # trim right (and chomp)
    # This split allows for spaces between words in a single column
    # allows also for tab-delimited record
    my @cols = split /(?:\t|\s{2,})/;
    # Normalize the text and test for uniqueness:
    #
    # By these manipulations: 
    #     Tiger   Animal
    # matches
    #     Tiger      Animal
    # for any column irregularities
    next if $uniq{join('-',@cols)};

    # count occurrence.
    $count_for{$cols[1]}++;
}

Upvotes: 1

joran

Reputation: 173627

If you're more comfortable with SQL, here's a very short solution using the sqldf package in R:

anm <- data.frame(A = c("Tiger", "Parrot", "Lion", "Elephant", "Crow", "Horse", "Man", "Dog", "Tiger"),
      B = c("Animal", "Bird", "Animal", "Animal", "Bird", "Animal", "Human", "Animal", "Animal"))

library(sqldf)      
sqldf("select B,count(distinct A) tot from anm group by B")

sqldf("select B,A,count(*) num from anm group by B,A HAVING num > 1")

Upvotes: 1

abra

Reputation: 575

#!/usr/bin/env perl

use strict;
use warnings;
use File::Slurp qw(slurp);

exit unless $ARGV[0];

my @data = slurp($ARGV[0]);
my (%h);

for (@data) {
  chomp;
  map { next if /^(A|B)$/; $h{$_}++ } split ' ', $_;
}

map { print $_, ": ", $h{$_}, "\n" } keys %h;

usage:

$ perl script.pl columns.txt

Upvotes: 0

Richie Cotton

Reputation: 121097

tapply from base R will solve this nicely.

with(anm, tapply(A, B, function(x) length(unique(x))))

Upvotes: 5

Ramnath

Reputation: 55695

Here is an approach using the plyr package in R.

mydf = read.table(textConnection(
"Tiger    Animal         
Parrot    Bird
Lion      Animal
Elephant  Animal
Crow      Bird
Horse     Animal
Man       Human
Dog       Animal
Tiger     Animal"))

library(plyr)
ddply(mydf, .(V2), summarize, V3 = length(V1))

    V2    V3
1 Animal  6
2   Bird  2
3  Human  1

ddply(mydf, .(V2, V1), summarize, V3 = length(V1))

    V2         V1  V3
1 Animal      Dog  1
2 Animal Elephant  1
3 Animal    Horse  1
4 Animal     Lion  1
5 Animal    Tiger  2
6   Bird     Crow  1
7   Bird   Parrot  1
8  Human      Man  1

EDIT. Adds the names of animals in each category

 ddply(mydf, .(V2), summarize, 
    V3 = length(V1), 
    V4 = do.call("paste", as.list(unique(V1))))

      V2 V3                            V4
1 Animal  6 Tiger Lion Elephant Horse Dog
2   Bird  2                   Parrot Crow
3  Human  1                           Man

Upvotes: 1

Roman Luštrik

Reputation: 70643

This is a solution done in R. Is this what you were looking for?

> anm <- data.frame(A = c("Tiger", "Parrot", "Lion", "Elephant", "Crow", "Horse", "Man", "Dog", "Tiger"),
+       B = c("Animal", "Bird", "Animal", "Animal", "Bird", "Animal", "Human", "Animal", "Animal"))
> anm
         A      B
1    Tiger Animal
2   Parrot   Bird
3     Lion Animal
4 Elephant Animal
5     Crow   Bird
6    Horse Animal
7      Man  Human
8      Dog Animal
9    Tiger Animal
> (col.anm <- colSums(table(anm)))
Animal   Bird  Human 
     6      2      1 
> table(anm)
          B
A          Animal Bird Human
  Crow          0    1     0
  Dog           1    0     0
  Elephant      1    0     0
  Horse         1    0     0
  Lion          1    0     0
  Man           0    0     1
  Parrot        0    1     0
  Tiger         2    0     0 # you can see how many times entry from A comes up

EDIT

To get the desired output format as noted in the comment, wrap your result in a data.frame.

> data.frame(col.anm)
       col.anm
Animal       6
Bird         2
Human        1

Upvotes: 3

Zsolt Botykai

Reputation: 51633

You can do the first easily with awk:

awk '{ myarray[$2]++ } END { for ( key in myarray ) { print key ": " myarray[key] } }' FILE

The second is a bit trickier... ( http://ideone.com/xdKcs )

awk '{ myarray[$2]++ ; myarray2[$2, $1]++ } 
     END { for ( key in myarray ) { print key ": " myarray[key] } 
           print
           print "Duplicates: "
           for (key in myarray2) { 
               split(key,sep,SUBSEP)
               if (myarray2[sep[1], sep[2]]>1)
                   { print sep[1] ": " sep[2] " " myarray2[sep[1], sep[2]]
     }}}' FILE

Upvotes: 1

Sacha Epskamp

Reputation: 47582

If your data is in R, you can use table() to get what you need. First some example data:

dat <- data.frame(A=c("tiger","parrot","lion","tiger"),B=c("animal","bird","animal","animal"))

Then we can get counts of B with:

table(dat$B)

and counts of co-occurance with:

table(dat)

To get the table you specified we can use the plyr package:

library("plyr")
tab <- ddply(dat,.(A,B),nrow)
tab[tab$V1>1,]
      A      B V1
3 tiger animal  2

Upvotes: 2

How can I count the number of entries of a column based on the distinct entries of another column

Answers (10)

Related Questions