cheklapkok
cheklapkok

Reputation: 439

How to match and replace strings in a data frame in R?

I have a main data frame contains lots of websites that I'm working with and another data frame contains a list of bad websites to match and identify whether I have bad websites in my main data frame. Since I'm very new to this, I'm not sure how to match and replace the bad websites to "www.badwebsite.com"? Thanks.

Here is an example of the data frames:

site_list <- data.frame("host" = c("www.companya.com", "www.companyb.com", "www.malwaresite.com",
                                   "www.companyc.com", "www.companyd.com", "www.virussite.com",
                                   "www.companye.com", "www.companyf.com", "www.phishingsite.com"),
                        "URL" = c("www.companya.com/home", "www.companyb.com/home", "www.malwaresite.com/home",
                                  "www.companyc.com/home", "www.companyd.com/home", "www.virussite.com/home",
                                  "www.companye.com/home", "www.companyf.com/home", "www.phishingsite.com/home"))

bad_site_list <- data.frame("host" = c("www.malwaresite.com", "www.virussite.com", "www.phishingsite.com"))

I hope to achieve this result:

host                                  URL
www.companya.com               www.companya.com/home
www.companyb.com               www.companyb.com/home
www.badwebsite.com             www.badwebsite.com/home
www.companyc.com               www.companyc.com/home
www.companyd.com               www.companyd.com/home
www.badwebsite.com             www.badwebsite.com/home
www.companye.com               www.companye.com/home
www.companyf.com               www.companyf.com/home
www.badwebsite.com             www.badwebsite.com/home

Upvotes: 1

Views: 1262

Answers (3)

Buddy
Buddy

Reputation: 11

Load library(stringr)

Search for a string in a vector:

str_detect(dataframe_name, "string_your_searching_for")

Replace String in Vector:

str_replace(dataframe_name, "old_string", "new_string")

Upvotes: 0

Andrew
Andrew

Reputation: 5138

Without regex you could so something like this:

# Converting factor columsn to character
site_list[] <- lapply(site_list, as.character)
bad_site_list[] <- lapply(bad_site_list, as.character)

# If you want to replace all the bad sites with "www.badwebsite.com" you can:
site_list$URL[site_list$host %in% bad_site_list$host] <- "www.badwebsite.com/home"
site_list$host[site_list$host %in% bad_site_list$host] <- "www.badwebsite.com"

site_list
                host                     URL
1   www.companya.com   www.companya.com/home
2   www.companyb.com   www.companyb.com/home
3 www.badwebsite.com www.badwebsite.com/home
4   www.companyc.com   www.companyc.com/home
5   www.companyd.com   www.companyd.com/home
6 www.badwebsite.com www.badwebsite.com/home
7   www.companye.com   www.companye.com/home
8   www.companyf.com   www.companyf.com/home
9 www.badwebsite.com www.badwebsite.com/home

Using regex you could so something like this:

# Using regex you could create a pattern 
bad_site_pattern <- paste(bad_site_list$host, collapse = "|")

# Then replace all instances in the dataframe using lapply
site_list[] <- lapply(site_list, gsub, pattern = bad_site_pattern, replacement = "www.badwebsite.com")

site_list
                host                     URL
1   www.companya.com   www.companya.com/home
2   www.companyb.com   www.companyb.com/home
3 www.badwebsite.com www.badwebsite.com/home
4   www.companyc.com   www.companyc.com/home
5   www.companyd.com   www.companyd.com/home
6 www.badwebsite.com www.badwebsite.com/home
7   www.companye.com   www.companye.com/home
8   www.companyf.com   www.companyf.com/home
9 www.badwebsite.com www.badwebsite.com/home

Upvotes: 1

sargg
sargg

Reputation: 333

I would do it the following way for your simple example, might not be optimal for more complex tables:

apply(site_list, 2, function(x)gsub(paste(bad_site_list$host, collapse="|"), "www.badwebsite.com", x))

In apply: "2" means you will apply a function on each column ("1" to apply per row).
The function looks for all the hosts in bad_site_list and replaces them with www.badwebsite.com (using gsub)

Upvotes: 1

Related Questions