ARJ
ARJ

Reputation: 2080

How to remove rows from a dataframe if 75 % of its column values is equal to 0

I have a data frame of 44 column and 60,000 rows. I wanted to remove those rows if it has 0 up to 75 % of columns. This 75% :For example in my case out of 44 columns its 33 columns. And so I tried the following function in R as,

filter <- apply(df, 1,function(x) any(x[1:33]!=0) && any(x[34:44]!=0) )
df = df[filter,]

It's perfectly looking for those columns I asked for. But the problem is my data frame has many rows with values like this,for certain rows there is zeros in alternate model ie, one column its numeric value then its zero and so on. This sometimes is more than 33 columns and the above function avoids those rows.

So far I tried in R , any solution which i can try in pandas are also would be also great ..I know when all values are not equal to zero in pandas

 df[(df != 0).all(1)]

Here is how my data frame looks like,

dim(df)
[1] 57905    44
head(df)

     ID Pe_1    Pe_2    Pe_3    Pe_4    Pe_5    Pe_6    Pe_7    Pe_8    Pe_9    Pe_10   Pe_11   Pe_12   Pe_13   Pe_14   Pe_15   Pe_16   Pe_17   Pe_18   Pe_19   Pe_20   Pe_21   Pe_22   Pe_23   Pe_24   Pe_25   Pe_26   Pe_27   Pe_28   Pe_29   Pe_30   Pe_31   Pe_32   Pe_33   Pe_34   Pe_35   Pe_36   Pe_37   Pe_38   Pe_39   Pe_40   Pe_41   Pe_42   Pe_43   Pe_44
ENSG1   0   0   1   0   0   2   2   1   0   0   0   1   0   3   3   0   1   0   2   0   2   3   1   2   0   2   0   0   0   0   0   2   0   0   0   0   2   0   0   2   0   3   1   3
ENSG2   274 293 300 273 229 124 427 291 274 561 128 506 342 540 376 422 411 190 723 224 303 316 766 697 251 167 271 361 325 133 215 274 217 366 227 579 337 254 570 188 143 363 250 359
ENSG3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
ENSG4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
ENSG5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
ENSG6   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
ENSG7   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
ENSG8   0   1   0   1   1   1   0   2   0   0   0   1   1   1   0   1   0   0   0   0   0   1   1   1   2   1   0   3   0   1   1   2   0   0   0   0   0   0   1   1   0   0   1   1
ENSG9   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
ENSG10  3   2   4   6   21  6   6   13  3   1   1   6   10  4   2   0   1   0   0   0   4   2   5   3   25  9   7   10  7   5   3   0   0   5   1   8   4   5   0   4   1   3   2   4
ENSG11  277 43  79  216 1170    174 213 1303    564 14  53  76  170 1016    32  19  69  69  50  21  75  31  560 86  2668    604 513 303 1378    109 219 172 10  1031    276 242 1587    217 76  43  450 81  502 99

Any suggetsions/help would be great

Upvotes: 0

Views: 4096

Answers (6)

ChristyCasey
ChristyCasey

Reputation: 446

R solution (hopefully)

I think I got this, skip down a bit for the piece you want. The code between here and there is so its all in context, and can be seen working in R.

MakeDF.R <- function(CustomVector,n){
  #just a function to make a sample df
  NewDF <- data.frame(matrix(nrow=n,ncol=length(CustomVector)))
  colnames(NewDF) <- CustomVector
  return(NewDF)
}

Make the data frame

DF<-MakeDF.R(c(1:44),10)
#sample df with 44 rows

Add some rows with zeros

DF[c(5,6,7),] <- c(1:44)*0
#add sample zero rows

Now for the bit you want.

RemoveRows <- lapply(1:nrow(DF),function(i){
  RemoveRow_i = 0
  if(
    length(which(DF[i,] == 0)) > (ncol(DF) * 0.75)  ){
    #above is the "more than 75% zero columns step
    RemoveRow_i = i #select this row as one to be removed
  }
  RemoveRow_i
  #return either the row number or zero
})

That gives us our list of rows to remove. Now we need to clean that list up (remove the zeroes)

RemoveRows = RemoveRows[RemoveRows > 0]
#Leaves no zeroes in the list

With the list cleaned up we can just remove the data

CleanedDF <- DF[-do.call(rbind,RemoveRows)[,1],]
#the do.call(rbind....) is returning a 1 column dataframe. 
#So the first column is the vector of rows we want to remove.
#the -c(row numbers) removes rows. df[-1,] would delete the first row
#df[-c(1,2),] would delete the first two rows

Upvotes: -3

Colonel Beauvel
Colonel Beauvel

Reputation: 31161

Seems you want to remove lines which have more than 75% of 0. Eg keep lines which have at least 25% of non zero-values.

In R:

df = data.frame(a=c(1,8,0), b=c(0,2,0), c=c(0,0,1), d=c(4,4,0))

df[rowMeans(df!=0)>0.25, ]  # or df[rowMeans(df==0)<0.75, ]
#  a b c d
#1 1 0 0 4
#2 8 2 0 4

And in Pandas:

df = pd.DataFrame({'a':[1,8,0],'b':[0,2,0],'c':[0,0,1], 'd':[4,4,0]})

# In [198]: df
# Out[198]:
#   a  b  c  d
#0  1  0  0  4
#1  8  2  0  4
#2  0  0  1  0

df[df.astype('bool').mean(axis=1)>=0.25] # or df[(~df.astype('bool')).mean(axis=1)<0.75]

#Out[199]:
#   a  b  c  d
#0  1  0  0  4
#1  8  2  0  4

Upvotes: 10

lmo
lmo

Reputation: 38500

Or analgously to @colonel-beauvel's use of rowMeans you can use rowSums.

df[rowSums(df[, -1] > 0) / (ncol(df)-1) >= 0.75,]
  • df[, -1] > 0 takes the data.frame, df except the ID column and returns a logical matrix whether each element.
  • rowSums adds all the rows together (it and rowMeans are super fast).
  • ncol returns the number of columns of df (from which I subtract 1).
  • The ratio of rowSums and ncol is compared to 0.75 which returns a logical.

This logical is used to subset the rows.

Upvotes: 2

Elliott Addi
Elliott Addi

Reputation: 380

Simple code, it should work:

for i in df:
    iLength = 0
    countZeros = 0

    for j in df:
        iLength += 1
        if i[j] == 0:
            countZeros += 1

    zeroRate = countZeros / iLength
    if zeroRate >= 0.75:
        #DeleteRow

Upvotes: 0

EdChum
EdChum

Reputation: 393943

Pandas method, here we compare the df against 0 and sum row-wise using axis=1, this will produce a Series with a count of the 0 values and we compare this against 75% of the row length and filter the df:

In [14]:
df[(df == 0).sum(axis=1) < df.shape[1] * 0.75]

Out[14]:
        ID  Pe_1  Pe_2  Pe_3  Pe_4  Pe_5  Pe_6  Pe_7  Pe_8  Pe_9  ...    \
0    ENSG1     0     0     1     0     0     2     2     1     0  ...     
1    ENSG2   274   293   300   273   229   124   427   291   274  ...     
7    ENSG8     0     1     0     1     1     1     0     2     0  ...     
9   ENSG10     3     2     4     6    21     6     6    13     3  ...     
10  ENSG11   277    43    79   216  1170   174   213  1303   564  ...     

    Pe_35  Pe_36  Pe_37  Pe_38  Pe_39  Pe_40  Pe_41  Pe_42  Pe_43  Pe_44  
0       0      0      2      0      0      2      0      3      1      3  
1     227    579    337    254    570    188    143    363    250    359  
7       0      0      0      0      1      1      0      0      1      1  
9       1      8      4      5      0      4      1      3      2      4  
10    276    242   1587    217     76     43    450     81    502     99  

[5 rows x 45 columns]

Upvotes: 4

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

try this (Pandas):

df[(df==0).sum(axis=1)/len(df.columns) <= 0.75]

Upvotes: 3

Related Questions