Reputation: 21
My data looks like this:
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 513 C A 0/1:23,12:35:99:262,0,691 19,10:-40.6,-28.8,-78.7:-11.9:6.0
2 3
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1095 G A 0/1:35,12:47:99:328,0,1157 30,11:-61.1,-63.4,-134.7:2.2:12.0
3 4
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1217 G A 0/1:22,12:34:99:314,0,730 20,10:-68.4,-54.2,-109.0:-14.2:6.0
4 5
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1219 A C 0/1:22,12:34:99:308,0,715 20,10:-69.9,-54.2,-107.7:-15.7:6.0
5 6
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1721 G C 0/1:15,6:21:99:141,0,464 7,5:-21.8,-18.5,-30.1:-3.3:4.0
6 8
BLOCK2 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 2171 G C 0/1:14,13:27:99:388,0,369 9,5:-35.3,-26.5,-46.7:-8.7:3.0
7 10
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3661 G A 0/1:148,55:203:99:1070,0,4008 107,39:-163.0,-160.9,-438.4:-2.1:33.0
8 11
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3700 C T 0/1:124,124:249:99:3271,0,3667 117,107:-510.2,-163.3,-565.9:-346.9:4.0
9 12
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3754 T C 0/1:140,107:248:99:2786,0,3946 133,101:-436.9,-85.9,-558.8:-351.0:2.0
10
all I want is an R command that will allow me to calculate various properties of, say, columns 2 and 3 (which are columns of 0's and 1's) and do this FOR EACH block (column1). So, for example, BLOCK1 below has 4 lines in it, BLOCK2 has one line, etc. A basic question I want to know is: for each block, how many zeros are there in column 2, and how many in column 3?
Can anyone help? I tried using various forms of aggregate(), but the problem is that the FUN argument doesn't allow me to do the above. Or maybe it does, but I can't figure it out...
Upvotes: 0
Views: 83
Reputation: 21
So the above solutions don't really give what I wanted, but I figured out how to do it in bash. I wanted to parse the file to keep the blocks in which, for a given block, all col2=0 and all col3=1 OR all col2=1 and all col3=0. And I also wanted a count of the number of such blocks. The following command worked:
sed ':a;N;$!ba;s/\nBLOCK/\n\nBLOCK/g;s/$/\n/' input.file |awk 'BEGIN { fg=0; num=0; block=""; type="";} { if (/^$/ && fg == 1) { print block; num++; fg=0; block=""; } else if (/^$/) {block=""; type=""; fg=0;} else if (/^BLOCK/) { block = block "\n" $; total++; } else if (fg == 0) { type=$2 " " $3; fg=1 ; block = block "\n" $; } else if (fg == 1 && type == $2 " " $3) { block = block "\n" $_ ;} else if (fg == 1 && type != $2 " " $3) { fg=2; block=""} } END {print "perfect:\t",num,"\tTotal:\t",total}' >output.file
Upvotes: 0
Reputation: 887058
You could use aggregate
from base R
aggregate(!df[,c("Col2", "Col3")], list(Col1=df[,"Col1"]), FUN=sum)
# Col1 Col2 Col3
#1 BLOCK1 5 0
#2 BLOCK2 1 0
#3 BLOCK3 0 3
Or using data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) sum(!x)), by=Col1]
# Col1 Col2 Col3
#1: BLOCK1 5 0
#2: BLOCK2 1 0
#3: BLOCK3 0 3
For the combinations, perhaps you need
as.data.frame.matrix(table(df[,1],as.character(interaction(df[,-1]))))
# 0.1 1.0
#BLOCK1 5 0
#BLOCK2 1 0
#BLOCK3 0 3
If you want to keep only cases in which the blocks have col2=0 AND col3=1 OR col2=1 AND col3=0, for ALL entries of a given block
:
Changing the example dataset to show some variation (in the present dataset, the condition will select all the rows)
df$Col3[4] <- 0
df$Col2[8]<-0
df$Col3[8]<-1
df[with(df, ave(Col2==0 & Col3==1|Col2==1 & Col3==0, Col1, FUN=all)),]
# Col1 Col2 Col3
#6 BLOCK2 0 1
#7 BLOCK3 1 0
#8 BLOCK3 0 1
#9 BLOCK3 1 0
df <- structure(list(Col1 = c("BLOCK1", "BLOCK1", "BLOCK1", "BLOCK1",
"BLOCK1", "BLOCK2", "BLOCK3", "BLOCK3", "BLOCK3"), Col2 = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), Col3 = c(1L, 1L, 1L, 1L, 1L,
1L, 0L, 0L, 0L)), .Names = c("Col1", "Col2", "Col3"), class = "data.frame", row.names = c(NA,
-9L))
Upvotes: 2
Reputation: 56149
Using dplyr
:
require(dplyr)
#dummy data
d <- read.table(text="
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 513 C A 0/1:23,12:35:99:262,0,691 19,10:-40.6,-28.8,-78.7:-11.9:6.0
2 3
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1095 G A 0/1:35,12:47:99:328,0,1157 30,11:-61.1,-63.4,-134.7:2.2:12.0
3 4
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1217 G A 0/1:22,12:34:99:314,0,730 20,10:-68.4,-54.2,-109.0:-14.2:6.0
4 5
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1219 A C 0/1:22,12:34:99:308,0,715 20,10:-69.9,-54.2,-107.7:-15.7:6.0
5 6
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1721 G C 0/1:15,6:21:99:141,0,464 7,5:-21.8,-18.5,-30.1:-3.3:4.0
6 8
BLOCK2 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 2171 G C 0/1:14,13:27:99:388,0,369 9,5:-35.3,-26.5,-46.7:-8.7:3.0
7 10
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3661 G A 0/1:148,55:203:99:1070,0,4008 107,39:-163.0,-160.9,-438.4:-2.1:33.0
8 11
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3700 C T 0/1:124,124:249:99:3271,0,3667 117,107:-510.2,-163.3,-565.9:-346.9:4.0
9 12
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3754 T C 0/1:140,107:248:99:2786,0,3946 133,101:-436.9,-85.9,-558.8:-351.0:2.0
10 ",fill = TRUE)
#keep only rows with BLOCK names and count zeros in column 2
d %>% filter(grepl("BLOCK",V1)) %>%
group_by(BLOCK=V1) %>%
summarise(ZeroCountInCol2=sum(V2==0))
# BLOCK ZeroCountInCol2
# 1 BLOCK1 5
# 2 BLOCK2 1
# 3 BLOCK3 0
Upvotes: 1
Reputation: 24074
If the name of your data.frame is dataframe :
sapply(unique(dataframe[,1]),function(block){list(nb0_col1=sum(dataframe[dataframe[,1]==block,2]==0,na.rm=T),nb0_col2=sum(dataframe[dataframe[,1]==block,3]==0,na.rm=T))})
Upvotes: 0