Reputation: 21
My data has rows that contain institutes with email addresses usually at the end. I want to remove only the email ads and keep the institutes (e.g. remove hello@canada).
df <- data.frame(institute = c(
"Air Quality Processes Research Section, Environment and Climate Change Canada, Toronto, Ontario, M3H 5T4, Canada",
"Air Quality Processes Research Section, Environment and Climate Change Canada, Toronto, Ontario, M3H 5T4, Canada. Electronic address: hello@canada",
"Aix-Marseille Universit.., Inserm, TAGC UMR S1090, 13288 Marseille, France. name@inserm",
"Applied Biological Sciences Program, Chulabhorn Graduate Institute, Bangkok, Thailand Laboratory of Biochemistry, Chulabhorn Research Institute, Bangkok, Thailand",
"Applied Biological Sciences Program, Chulabhorn Graduate Institute, Bangkok, Thailand Laboratory of Biochemistry, Chulabhorn Research Institute, Bangkok, Thailand [email protected]"))
My goal is to be able to count the same institutes as one, since in the format above, the email addresses make the rows distinct.
I tried the code below for the first institute, but it didn't remove the complete email address.
a <- "Air Quality Processes Research Section, Environment and Climate Change Canada, Toronto, Ontario, M3H 5T4, Canada. Electronic address: hello@canada"
gsub("[^.*?]@.*", "\\1", a)
# [1] "Air Quality Processes Research Section, Environment and Climate Change Canada, Toronto, Ontario, M3H 5T4, Canada. Electronic address: hell"
Upvotes: 0
Views: 39
Reputation: 388907
You could use something like this :
df$clean_institute <- trimws(gsub('\\w+@.*$|Electronic address:|email address:',
'', df$institute))
This removes a word before '@'
, '@'
and everything after it. Apart from that it also removes words like 'Electronic address:'
and 'email address:'
.
then use table
to count
table(df$clean_institute)
Upvotes: 2