Rene Chan
Rene Chan

Reputation: 985

Compare strings in r and create dataframe

I have dataframe with email and domain names, i would like to separate email address which match the domain and the one which do not match.

Say I have a a df:

email <- c('[email protected]', '[email protected]', '[email protected]', '[email protected]' , '[email protected]')
website <- c('http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.asdf.com')
df <- as.data.frame(cbind(email,website))

which results in :

> df
        email            website
1 [email protected] http://www.kjf.com
2 [email protected] http://www.kjf.com
3 [email protected] http://www.kjf.com
4 [email protected] http://www.kjf.com
5 [email protected] http://www.asdf.com

I would like to create dynamically 2 dataframes. one where the email's domain matches the website domain like:

> df2
        email            website
1 [email protected] http://www.kjf.com
2 [email protected] http://www.kjf.com
3 [email protected] http://www.kjf.com
4 [email protected] http://www.asdf.com

and one that keeps the non-matching, like;

> df3
        email            website
1 [email protected] http://www.kjf.com

I think i should use "regex" but i am not sure. Does anybody sees how this is doable please? Thank you

Upvotes: 1

Views: 590

Answers (2)

r2evans
r2evans

Reputation: 160792

Using this, you can filter the rows

gsub('.*@', '', df$email) != gsub('https?://(www\\.)?', '', df$website)
# [1] FALSE  TRUE FALSE FALSE FALSE

Breakdown:

gsub('.*@', '', df$email)
###   .*   zero or more characters, followed by
###     @  a literal ampersand
# [1] "kjf.com"  "def.com"  "kjf.com"  "kjf.com"  "asdf.com"

and for the url:

gsub('https?://(www\\.)?', '', df$website)
###   http                literal string 'http'
###       s?              with exactly zero or one instance 's'
###         ://           literal string '://'
###            (www\\.)?  with exactly zero or one instance of 'www.'
# [1] "kjf.com"  "kjf.com"  "kjf.com"  "kjf.com"  "asdf.com"

Upvotes: 3

Werner
Werner

Reputation: 15105

You can create a column that identifies whether the email and website domains are the same:

library(tidyverse)

email <- c('[email protected]', '[email protected]', '[email protected]', '[email protected]' , '[email protected]')
website <- c('http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.asdf.com')
df <- data.frame(
  email = email,
  website = website
)

df <- df %>% mutate(
  same = (email %>% str_sub(
    start = str_locate(., '@')[,'end'] + 1,
    end = -1L)) ==
    (website %>% str_sub(
      start = str_locate(., 'www.')[,'end'] + 1,
      end = -1L))
)

df2 <- df %>% filter(
  same
) %>% select(
  -same
)

df3 <- df %>% filter(
  !same
) %>% select(
  -same
)

Upvotes: 1

Related Questions