Reputation: 133
I have seen lots of posts on how to remove rows if user specified columns contain a certain string.
I want to do the reverse and generalise it. I want to remove every column if any row in that column contains a certain string. (To compare with Excel, I would find all cells containing a given string and then delete every column.)
How can I do this? I was thinking with dplyr
and filter
, but I have to specify columns I think, or at least the way I would know how to approach it. But I have 300 odd columns and almost 4000 rows.
EDIT: Here is a sample of my dataframe.
# A tibble: 6 x 310
ISIN AU000KFWHAC9 AU3CB0243657 AU3CB0256162 AU3CB0260321 AU3CB0265239 AU3CB0283190 AU3SG0001928 AU3SG0002371
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Timestamp MID_PRICE Mid Price Cl~ Mid Price C~ Mid Price C~ Mid Price C~ Mid Price C~ Mid Price C~ Mid Price C~
2 41275 Invalid RIC. NA NA Invalid RIC. NA Invalid RIC. NA NA
3 41276 NA NA NA NA NA NA NA NA
4 41277 NA NA NA NA 3 NA NA NA
5 41278 NA NA NA NA NA NA NA NA
6 41279 5 NA 4 NA NA NA NA NA
So as you can see, the dataframe is full of lots of NA's. I am unsure if this will affect some functions' ability.
Upvotes: 3
Views: 2053
Reputation: 39154
A solution using dplyr
. We can use select
and where
to apply a function to check if a column contains a certain string or not. dat
is from Andre Wildberg's answer.
library(dplyr)
dat2 <- dat %>%
select(where(function(x) all(!grepl("Invalid", x))))
dat2
# ISIN AU3CB0243657 AU3CB0256162 AU3CB0265239 AU3SG0001928 AU3SG0002371
# 1 Timestamp MidPriceC~ MidPriceC~ MidPriceC~ MidPriceC~ MidPriceC~
# 2 41275 <NA> <NA> <NA> <NA> <NA>
# 3 41276 <NA> <NA> <NA> <NA> <NA>
# 4 41277 <NA> <NA> 3 <NA> <NA>
# 5 41278 <NA> <NA> <NA> <NA> <NA>
# 6 41279 <NA> 4 <NA> <NA> <NA>
Upvotes: 1
Reputation: 19088
You can grep your search:
dat[,-grep("Invalid", dat)]
ISIN AU3CB0243657 AU3CB0256162 AU3CB0265239 AU3SG0001928 AU3SG0002371
1 Timestamp MidPriceC~ MidPriceC~ MidPriceC~ MidPriceC~ MidPriceC~
2 41275 <NA> <NA> <NA> <NA> <NA>
3 41276 <NA> <NA> <NA> <NA> <NA>
4 41277 <NA> <NA> 3 <NA> <NA>
5 41278 <NA> <NA> <NA> <NA> <NA>
6 41279 <NA> 4 <NA> <NA> <NA>
Data:
dat <- structure(list(ISIN = c("Timestamp", "41275", "41276", "41277",
"41278", "41279"), AU000KFWHAC9 = c("MID_PRICE", "Invalid_RIC.",
NA, NA, NA, "5"), AU3CB0243657 = c("MidPriceC~", NA, NA, NA,
NA, NA), AU3CB0256162 = c("MidPriceC~", NA, NA, NA, NA, "4"),
AU3CB0260321 = c("MidPriceC~", "Invalid_RIC.", NA, NA, NA,
NA), AU3CB0265239 = c("MidPriceC~", NA, NA, "3", NA, NA),
AU3CB0283190 = c("MidPriceC~", "Invalid_RIC.", NA, NA, NA,
NA), AU3SG0001928 = c("MidPriceC~", NA, NA, NA, NA, NA),
AU3SG0002371 = c("MidPriceC~", NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 1
Reputation: 71560
With a dataframe of:
> df <- data.frame(a=c("a", "b", "c"), b=c("bad string", "d", "e"), c=c("f", "g", "h"))
> df
a b c
1 a bad string f
2 b d g
3 c e h
>
Use colSums
:
> df[, !colSums(df == "bad string")]
a c
1 a f
2 b g
3 c h
>
Only keep columns where colSums
is 0
.
Upvotes: 1