Reputation: 439
I have a main data frame contains lots of websites that I'm working with and another data frame contains a list of bad websites to match and identify whether I have bad websites in my main data frame. Since I'm very new to this, I'm not sure how to match and replace the bad websites to "www.badwebsite.com"? Thanks.
Here is an example of the data frames:
site_list <- data.frame("host" = c("www.companya.com", "www.companyb.com", "www.malwaresite.com",
"www.companyc.com", "www.companyd.com", "www.virussite.com",
"www.companye.com", "www.companyf.com", "www.phishingsite.com"),
"URL" = c("www.companya.com/home", "www.companyb.com/home", "www.malwaresite.com/home",
"www.companyc.com/home", "www.companyd.com/home", "www.virussite.com/home",
"www.companye.com/home", "www.companyf.com/home", "www.phishingsite.com/home"))
bad_site_list <- data.frame("host" = c("www.malwaresite.com", "www.virussite.com", "www.phishingsite.com"))
I hope to achieve this result:
host URL
www.companya.com www.companya.com/home
www.companyb.com www.companyb.com/home
www.badwebsite.com www.badwebsite.com/home
www.companyc.com www.companyc.com/home
www.companyd.com www.companyd.com/home
www.badwebsite.com www.badwebsite.com/home
www.companye.com www.companye.com/home
www.companyf.com www.companyf.com/home
www.badwebsite.com www.badwebsite.com/home
Upvotes: 1
Views: 1262
Reputation: 11
Load library(stringr)
str_detect(dataframe_name, "string_your_searching_for")
str_replace(dataframe_name, "old_string", "new_string")
Upvotes: 0
Reputation: 5138
Without regex you could so something like this:
# Converting factor columsn to character
site_list[] <- lapply(site_list, as.character)
bad_site_list[] <- lapply(bad_site_list, as.character)
# If you want to replace all the bad sites with "www.badwebsite.com" you can:
site_list$URL[site_list$host %in% bad_site_list$host] <- "www.badwebsite.com/home"
site_list$host[site_list$host %in% bad_site_list$host] <- "www.badwebsite.com"
site_list
host URL
1 www.companya.com www.companya.com/home
2 www.companyb.com www.companyb.com/home
3 www.badwebsite.com www.badwebsite.com/home
4 www.companyc.com www.companyc.com/home
5 www.companyd.com www.companyd.com/home
6 www.badwebsite.com www.badwebsite.com/home
7 www.companye.com www.companye.com/home
8 www.companyf.com www.companyf.com/home
9 www.badwebsite.com www.badwebsite.com/home
Using regex you could so something like this:
# Using regex you could create a pattern
bad_site_pattern <- paste(bad_site_list$host, collapse = "|")
# Then replace all instances in the dataframe using lapply
site_list[] <- lapply(site_list, gsub, pattern = bad_site_pattern, replacement = "www.badwebsite.com")
site_list
host URL
1 www.companya.com www.companya.com/home
2 www.companyb.com www.companyb.com/home
3 www.badwebsite.com www.badwebsite.com/home
4 www.companyc.com www.companyc.com/home
5 www.companyd.com www.companyd.com/home
6 www.badwebsite.com www.badwebsite.com/home
7 www.companye.com www.companye.com/home
8 www.companyf.com www.companyf.com/home
9 www.badwebsite.com www.badwebsite.com/home
Upvotes: 1
Reputation: 333
I would do it the following way for your simple example, might not be optimal for more complex tables:
apply(site_list, 2, function(x)gsub(paste(bad_site_list$host, collapse="|"), "www.badwebsite.com", x))
In apply: "2" means you will apply a function on each column ("1" to apply per row).
The function looks for all the hosts in bad_site_list and replaces them with www.badwebsite.com (using gsub)
Upvotes: 1