Reputation: 133

How to remove column(s) if a row contains a value?

I have seen lots of posts on how to remove rows if user specified columns contain a certain string.

I want to do the reverse and generalise it. I want to remove every column if any row in that column contains a certain string. (To compare with Excel, I would find all cells containing a given string and then delete every column.)

How can I do this? I was thinking with dplyr and filter, but I have to specify columns I think, or at least the way I would know how to approach it. But I have 300 odd columns and almost 4000 rows.

EDIT: Here is a sample of my dataframe.

# A tibble: 6 x 310
  ISIN      AU000KFWHAC9 AU3CB0243657  AU3CB0256162 AU3CB0260321 AU3CB0265239 AU3CB0283190 AU3SG0001928 AU3SG0002371
  <chr>     <chr>        <chr>         <chr>        <chr>        <chr>        <chr>        <chr>        <chr>       
1 Timestamp MID_PRICE    Mid Price Cl~ Mid Price C~ Mid Price C~ Mid Price C~ Mid Price C~ Mid Price C~ Mid Price C~
2 41275     Invalid RIC. NA            NA           Invalid RIC. NA           Invalid RIC. NA           NA          
3 41276     NA           NA            NA           NA           NA           NA           NA           NA          
4 41277     NA           NA            NA           NA           3            NA           NA           NA          
5 41278     NA           NA            NA           NA           NA           NA           NA           NA          
6 41279     5            NA            4            NA           NA           NA           NA           NA

So as you can see, the dataframe is full of lots of NA's. I am unsure if this will affect some functions' ability.

Upvotes: 3

Answers (3)

www

Reputation: 39154

A solution using dplyr. We can use select and where to apply a function to check if a column contains a certain string or not. dat is from Andre Wildberg's answer.

library(dplyr)

dat2 <- dat %>%
  select(where(function(x) all(!grepl("Invalid", x))))
dat2
  #        ISIN AU3CB0243657 AU3CB0256162 AU3CB0265239 AU3SG0001928 AU3SG0002371
  # 1 Timestamp   MidPriceC~   MidPriceC~   MidPriceC~   MidPriceC~   MidPriceC~
  # 2     41275         <NA>         <NA>         <NA>         <NA>         <NA>
  # 3     41276         <NA>         <NA>         <NA>         <NA>         <NA>
  # 4     41277         <NA>         <NA>            3         <NA>         <NA>
  # 5     41278         <NA>         <NA>         <NA>         <NA>         <NA>
  # 6     41279         <NA>            4         <NA>         <NA>         <NA>

Upvotes: 1

Andre Wildberg

Reputation: 19088

You can grep your search:

dat[,-grep("Invalid", dat)]

       ISIN AU3CB0243657 AU3CB0256162 AU3CB0265239 AU3SG0001928 AU3SG0002371
1 Timestamp   MidPriceC~   MidPriceC~   MidPriceC~   MidPriceC~   MidPriceC~
2     41275         <NA>         <NA>         <NA>         <NA>         <NA>
3     41276         <NA>         <NA>         <NA>         <NA>         <NA>
4     41277         <NA>         <NA>            3         <NA>         <NA>
5     41278         <NA>         <NA>         <NA>         <NA>         <NA>
6     41279         <NA>            4         <NA>         <NA>         <NA>

Data:

dat <- structure(list(ISIN = c("Timestamp", "41275", "41276", "41277", 
"41278", "41279"), AU000KFWHAC9 = c("MID_PRICE", "Invalid_RIC.", 
NA, NA, NA, "5"), AU3CB0243657 = c("MidPriceC~", NA, NA, NA, 
NA, NA), AU3CB0256162 = c("MidPriceC~", NA, NA, NA, NA, "4"), 
    AU3CB0260321 = c("MidPriceC~", "Invalid_RIC.", NA, NA, NA, 
    NA), AU3CB0265239 = c("MidPriceC~", NA, NA, "3", NA, NA), 
    AU3CB0283190 = c("MidPriceC~", "Invalid_RIC.", NA, NA, NA, 
    NA), AU3SG0001928 = c("MidPriceC~", NA, NA, NA, NA, NA), 
    AU3SG0002371 = c("MidPriceC~", NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-6L))

Upvotes: 1

U13-Forward

Reputation: 71560

With a dataframe of:

> df <- data.frame(a=c("a", "b", "c"), b=c("bad string", "d", "e"), c=c("f", "g", "h"))
> df
  a          b c
1 a bad string f
2 b          d g
3 c          e h
>

Use colSums:

> df[, !colSums(df == "bad string")]
  a c
1 a f
2 b g
3 c h
>

Only keep columns where colSums is 0.

Upvotes: 1

How to remove column(s) if a row contains a value?

Answers (3)

Related Questions