Reputation: 985
I have dataframe with email and domain names, i would like to separate email address which match the domain and the one which do not match.
Say I have a a df:
email <- c('[email protected]', '[email protected]', '[email protected]', '[email protected]' , '[email protected]')
website <- c('http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.asdf.com')
df <- as.data.frame(cbind(email,website))
which results in :
> df
email website
1 [email protected] http://www.kjf.com
2 [email protected] http://www.kjf.com
3 [email protected] http://www.kjf.com
4 [email protected] http://www.kjf.com
5 [email protected] http://www.asdf.com
I would like to create dynamically 2 dataframes. one where the email's domain matches the website domain like:
> df2
email website
1 [email protected] http://www.kjf.com
2 [email protected] http://www.kjf.com
3 [email protected] http://www.kjf.com
4 [email protected] http://www.asdf.com
and one that keeps the non-matching, like;
> df3
email website
1 [email protected] http://www.kjf.com
I think i should use "regex" but i am not sure. Does anybody sees how this is doable please? Thank you
Upvotes: 1
Views: 590
Reputation: 160792
Using this, you can filter the rows
gsub('.*@', '', df$email) != gsub('https?://(www\\.)?', '', df$website)
# [1] FALSE TRUE FALSE FALSE FALSE
Breakdown:
gsub('.*@', '', df$email)
### .* zero or more characters, followed by
### @ a literal ampersand
# [1] "kjf.com" "def.com" "kjf.com" "kjf.com" "asdf.com"
and for the url:
gsub('https?://(www\\.)?', '', df$website)
### http literal string 'http'
### s? with exactly zero or one instance 's'
### :// literal string '://'
### (www\\.)? with exactly zero or one instance of 'www.'
# [1] "kjf.com" "kjf.com" "kjf.com" "kjf.com" "asdf.com"
Upvotes: 3
Reputation: 15105
You can create a column that identifies whether the email and website domains are the same:
library(tidyverse)
email <- c('[email protected]', '[email protected]', '[email protected]', '[email protected]' , '[email protected]')
website <- c('http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.asdf.com')
df <- data.frame(
email = email,
website = website
)
df <- df %>% mutate(
same = (email %>% str_sub(
start = str_locate(., '@')[,'end'] + 1,
end = -1L)) ==
(website %>% str_sub(
start = str_locate(., 'www.')[,'end'] + 1,
end = -1L))
)
df2 <- df %>% filter(
same
) %>% select(
-same
)
df3 <- df %>% filter(
!same
) %>% select(
-same
)
Upvotes: 1