Reputation: 2080
I have a data frame of 44 column and 60,000 rows. I wanted to remove those rows if it has 0 up to 75 % of columns. This 75% :For example in my case out of 44 columns its 33 columns. And so I tried the following function in R as,
filter <- apply(df, 1,function(x) any(x[1:33]!=0) && any(x[34:44]!=0) )
df = df[filter,]
It's perfectly looking for those columns I asked for. But the problem is my data frame has many rows with values like this,for certain rows there is zeros in alternate model ie, one column its numeric value then its zero and so on. This sometimes is more than 33 columns and the above function avoids those rows.
So far I tried in R , any solution which i can try in pandas are also would be also great ..I know when all values are not equal to zero in pandas
df[(df != 0).all(1)]
Here is how my data frame looks like,
dim(df)
[1] 57905 44
head(df)
ID Pe_1 Pe_2 Pe_3 Pe_4 Pe_5 Pe_6 Pe_7 Pe_8 Pe_9 Pe_10 Pe_11 Pe_12 Pe_13 Pe_14 Pe_15 Pe_16 Pe_17 Pe_18 Pe_19 Pe_20 Pe_21 Pe_22 Pe_23 Pe_24 Pe_25 Pe_26 Pe_27 Pe_28 Pe_29 Pe_30 Pe_31 Pe_32 Pe_33 Pe_34 Pe_35 Pe_36 Pe_37 Pe_38 Pe_39 Pe_40 Pe_41 Pe_42 Pe_43 Pe_44
ENSG1 0 0 1 0 0 2 2 1 0 0 0 1 0 3 3 0 1 0 2 0 2 3 1 2 0 2 0 0 0 0 0 2 0 0 0 0 2 0 0 2 0 3 1 3
ENSG2 274 293 300 273 229 124 427 291 274 561 128 506 342 540 376 422 411 190 723 224 303 316 766 697 251 167 271 361 325 133 215 274 217 366 227 579 337 254 570 188 143 363 250 359
ENSG3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG8 0 1 0 1 1 1 0 2 0 0 0 1 1 1 0 1 0 0 0 0 0 1 1 1 2 1 0 3 0 1 1 2 0 0 0 0 0 0 1 1 0 0 1 1
ENSG9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG10 3 2 4 6 21 6 6 13 3 1 1 6 10 4 2 0 1 0 0 0 4 2 5 3 25 9 7 10 7 5 3 0 0 5 1 8 4 5 0 4 1 3 2 4
ENSG11 277 43 79 216 1170 174 213 1303 564 14 53 76 170 1016 32 19 69 69 50 21 75 31 560 86 2668 604 513 303 1378 109 219 172 10 1031 276 242 1587 217 76 43 450 81 502 99
Any suggetsions/help would be great
Upvotes: 0
Views: 4096
Reputation: 446
R solution (hopefully)
I think I got this, skip down a bit for the piece you want. The code between here and there is so its all in context, and can be seen working in R.
MakeDF.R <- function(CustomVector,n){
#just a function to make a sample df
NewDF <- data.frame(matrix(nrow=n,ncol=length(CustomVector)))
colnames(NewDF) <- CustomVector
return(NewDF)
}
Make the data frame
DF<-MakeDF.R(c(1:44),10)
#sample df with 44 rows
Add some rows with zeros
DF[c(5,6,7),] <- c(1:44)*0
#add sample zero rows
Now for the bit you want.
RemoveRows <- lapply(1:nrow(DF),function(i){
RemoveRow_i = 0
if(
length(which(DF[i,] == 0)) > (ncol(DF) * 0.75) ){
#above is the "more than 75% zero columns step
RemoveRow_i = i #select this row as one to be removed
}
RemoveRow_i
#return either the row number or zero
})
That gives us our list of rows to remove. Now we need to clean that list up (remove the zeroes)
RemoveRows = RemoveRows[RemoveRows > 0]
#Leaves no zeroes in the list
With the list cleaned up we can just remove the data
CleanedDF <- DF[-do.call(rbind,RemoveRows)[,1],]
#the do.call(rbind....) is returning a 1 column dataframe.
#So the first column is the vector of rows we want to remove.
#the -c(row numbers) removes rows. df[-1,] would delete the first row
#df[-c(1,2),] would delete the first two rows
Upvotes: -3
Reputation: 31161
Seems you want to remove lines which have more than 75% of 0
. Eg keep lines which have at least 25% of non zero-values.
In R
:
df = data.frame(a=c(1,8,0), b=c(0,2,0), c=c(0,0,1), d=c(4,4,0))
df[rowMeans(df!=0)>0.25, ] # or df[rowMeans(df==0)<0.75, ]
# a b c d
#1 1 0 0 4
#2 8 2 0 4
And in Pandas
:
df = pd.DataFrame({'a':[1,8,0],'b':[0,2,0],'c':[0,0,1], 'd':[4,4,0]})
# In [198]: df
# Out[198]:
# a b c d
#0 1 0 0 4
#1 8 2 0 4
#2 0 0 1 0
df[df.astype('bool').mean(axis=1)>=0.25] # or df[(~df.astype('bool')).mean(axis=1)<0.75]
#Out[199]:
# a b c d
#0 1 0 0 4
#1 8 2 0 4
Upvotes: 10
Reputation: 38500
Or analgously to @colonel-beauvel's use of rowMeans
you can use rowSums
.
df[rowSums(df[, -1] > 0) / (ncol(df)-1) >= 0.75,]
rowSums
adds all the rows together (it and rowMeans
are super fast).ncol
returns the number of columns of df (from which I subtract 1).rowSums
and ncol
is compared to 0.75 which returns a logical.This logical is used to subset the rows.
Upvotes: 2
Reputation: 380
Simple code, it should work:
for i in df:
iLength = 0
countZeros = 0
for j in df:
iLength += 1
if i[j] == 0:
countZeros += 1
zeroRate = countZeros / iLength
if zeroRate >= 0.75:
#DeleteRow
Upvotes: 0
Reputation: 393943
Pandas method, here we compare the df against 0
and sum
row-wise using axis=1
, this will produce a Series
with a count of the 0 values and we compare this against 75% of the row length and filter the df:
In [14]:
df[(df == 0).sum(axis=1) < df.shape[1] * 0.75]
Out[14]:
ID Pe_1 Pe_2 Pe_3 Pe_4 Pe_5 Pe_6 Pe_7 Pe_8 Pe_9 ... \
0 ENSG1 0 0 1 0 0 2 2 1 0 ...
1 ENSG2 274 293 300 273 229 124 427 291 274 ...
7 ENSG8 0 1 0 1 1 1 0 2 0 ...
9 ENSG10 3 2 4 6 21 6 6 13 3 ...
10 ENSG11 277 43 79 216 1170 174 213 1303 564 ...
Pe_35 Pe_36 Pe_37 Pe_38 Pe_39 Pe_40 Pe_41 Pe_42 Pe_43 Pe_44
0 0 0 2 0 0 2 0 3 1 3
1 227 579 337 254 570 188 143 363 250 359
7 0 0 0 0 1 1 0 0 1 1
9 1 8 4 5 0 4 1 3 2 4
10 276 242 1587 217 76 43 450 81 502 99
[5 rows x 45 columns]
Upvotes: 4
Reputation: 210832
try this (Pandas):
df[(df==0).sum(axis=1)/len(df.columns) <= 0.75]
Upvotes: 3