Reputation: 475
My data in a file is like below with multiple columns:
A B
Tiger Animal
Parrot Bird
Lion Animal
Elephant Animal
Crow Bird
Horse Animal
Man Human
Dog Animal
I want to find the number of entries in column A corresponding to distinct entries in column B. If possible in R or may be a perl script for this.
Output as:
Animal 5
Bird 2
Human 1
Moreover, if possible to find out if the entries in column A has been repeated for the distinct entries in column B like
A B
Tiger Animal
Tiger Animal
Upvotes: 1
Views: 762
Reputation: 66819
In case anyone else comes by here, here are a couple more approaches that work.
myout <- lapply(split(anm,list(anm$B)),function(x)
list(length(unique(x[,"A"])),x[duplicated(x),"A"])
)
unlist(sapply(myout,function(x)x[1])) # counts in each category
sapply(myout,function(x)x[-1]) # list of duplicated names
or....
library(data.table)
mydt <- data.table(anm,key="B")
mydt[,.N,by=key(mydt)]
mydt[,.N,by="B,A"][N>1]
where....
anm = read.table(textConnection(
"Tiger Animal
Parrot Bird
Lion Animal
Elephant Animal
Crow Bird
Horse Animal
Man Human
Dog Animal
Tiger Animal"))
names(anm) <- c("A","B")
EDIT: Edited in response to comment by Matthew Dowle (author of data.table).
Upvotes: 2
Reputation: 21
Not sure I get the full data structure in the file, but if you're on UNIX:
tr -s ' ' | sort -u | awk '{ print $2}' | sort | uniq -c
5 Animal
2 Bird
1 Human
The above works, even if I add this line: "Tiger Animal" at the end, because of the first sort -u.
The tr -s squeezes out multiple blank spaces (so the sort commands act as expected)
Upvotes: 2
Reputation: 29854
In Perl (strict
and warnings
implied.)
my ( %uniq, %count_for );
# here $fh = some input source
while ( <$fh> ) {
s/^\s+//; # trim left
s/\s*$//; # trim right (and chomp)
# This split allows for spaces between words in a single column
# allows also for tab-delimited record
my @cols = split /(?:\t|\s{2,})/;
# Normalize the text and test for uniqueness:
#
# By these manipulations:
# Tiger Animal
# matches
# Tiger Animal
# for any column irregularities
next if $uniq{join('-',@cols)};
# count occurrence.
$count_for{$cols[1]}++;
}
Upvotes: 1
Reputation: 173627
If you're more comfortable with SQL, here's a very short solution using the sqldf
package in R:
anm <- data.frame(A = c("Tiger", "Parrot", "Lion", "Elephant", "Crow", "Horse", "Man", "Dog", "Tiger"),
B = c("Animal", "Bird", "Animal", "Animal", "Bird", "Animal", "Human", "Animal", "Animal"))
library(sqldf)
sqldf("select B,count(distinct A) tot from anm group by B")
sqldf("select B,A,count(*) num from anm group by B,A HAVING num > 1")
Upvotes: 1
Reputation: 575
#!/usr/bin/env perl
use strict;
use warnings;
use File::Slurp qw(slurp);
exit unless $ARGV[0];
my @data = slurp($ARGV[0]);
my (%h);
for (@data) {
chomp;
map { next if /^(A|B)$/; $h{$_}++ } split ' ', $_;
}
map { print $_, ": ", $h{$_}, "\n" } keys %h;
usage:
$ perl script.pl columns.txt
Upvotes: 0
Reputation: 121097
tapply
from base R will solve this nicely.
with(anm, tapply(A, B, function(x) length(unique(x))))
Upvotes: 5
Reputation: 55695
Here is an approach using the plyr
package in R.
mydf = read.table(textConnection(
"Tiger Animal
Parrot Bird
Lion Animal
Elephant Animal
Crow Bird
Horse Animal
Man Human
Dog Animal
Tiger Animal"))
library(plyr)
ddply(mydf, .(V2), summarize, V3 = length(V1))
V2 V3
1 Animal 6
2 Bird 2
3 Human 1
ddply(mydf, .(V2, V1), summarize, V3 = length(V1))
V2 V1 V3
1 Animal Dog 1
2 Animal Elephant 1
3 Animal Horse 1
4 Animal Lion 1
5 Animal Tiger 2
6 Bird Crow 1
7 Bird Parrot 1
8 Human Man 1
EDIT. Adds the names of animals in each category
ddply(mydf, .(V2), summarize,
V3 = length(V1),
V4 = do.call("paste", as.list(unique(V1))))
V2 V3 V4
1 Animal 6 Tiger Lion Elephant Horse Dog
2 Bird 2 Parrot Crow
3 Human 1 Man
Upvotes: 1
Reputation: 70643
This is a solution done in R. Is this what you were looking for?
> anm <- data.frame(A = c("Tiger", "Parrot", "Lion", "Elephant", "Crow", "Horse", "Man", "Dog", "Tiger"),
+ B = c("Animal", "Bird", "Animal", "Animal", "Bird", "Animal", "Human", "Animal", "Animal"))
> anm
A B
1 Tiger Animal
2 Parrot Bird
3 Lion Animal
4 Elephant Animal
5 Crow Bird
6 Horse Animal
7 Man Human
8 Dog Animal
9 Tiger Animal
> (col.anm <- colSums(table(anm)))
Animal Bird Human
6 2 1
> table(anm)
B
A Animal Bird Human
Crow 0 1 0
Dog 1 0 0
Elephant 1 0 0
Horse 1 0 0
Lion 1 0 0
Man 0 0 1
Parrot 0 1 0
Tiger 2 0 0 # you can see how many times entry from A comes up
EDIT
To get the desired output format as noted in the comment, wrap your result in a data.frame
.
> data.frame(col.anm)
col.anm
Animal 6
Bird 2
Human 1
Upvotes: 3
Reputation: 51633
You can do the first easily with awk
:
awk '{ myarray[$2]++ } END { for ( key in myarray ) { print key ": " myarray[key] } }' FILE
The second is a bit trickier... ( http://ideone.com/xdKcs )
awk '{ myarray[$2]++ ; myarray2[$2, $1]++ }
END { for ( key in myarray ) { print key ": " myarray[key] }
print
print "Duplicates: "
for (key in myarray2) {
split(key,sep,SUBSEP)
if (myarray2[sep[1], sep[2]]>1)
{ print sep[1] ": " sep[2] " " myarray2[sep[1], sep[2]]
}}}' FILE
Upvotes: 1
Reputation: 47582
If your data is in R, you can use table()
to get what you need. First some example data:
dat <- data.frame(A=c("tiger","parrot","lion","tiger"),B=c("animal","bird","animal","animal"))
Then we can get counts of B
with:
table(dat$B)
and counts of co-occurance with:
table(dat)
To get the table you specified we can use the plyr
package:
library("plyr")
tab <- ddply(dat,.(A,B),nrow)
tab[tab$V1>1,]
A B V1
3 tiger animal 2
Upvotes: 2