Reputation: 2690
I have a large dataset where some of the information needed is stored in the first column as a string separated by semicolons. For example:
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
Giving:
Information Data
1 Forrest;Trees;Unknown 5
2 Forrest;Trees;Leaves 1
3 Forrest;Trees;Trunks 3
4 Forrest;Shrubs;Unknown 4
5 Forrest;Shrubs;Branches 2
6 Forrest;Shrubs;Leaves 1
7 Forrest;Shrubs;NA 3
I need to simplify the names so that I only have the last unique name that isn't "Unknown" or "NA" such that my dataframe becomes:
Information Data
1 Trees;Unknown 5
2 Trees;Leaves 1
3 Trunks 3
4 Shrubs;Unknown 4
5 Branches 2
6 Shrubs;Leaves 1
7 Shrubs;NA 3
Upvotes: 0
Views: 142
Reputation: 606
Generally it is not recommended to put multiple variables in the same column but using dplyr should give you what you want:
TestData_filtered<-TestData%>%separate(Information,into=c("common","TS","BL"),remove=FALSE)%>%filter(!grepl("Unknown|NA",BL))%>%mutate(wanted=paste(TS,BL,sep=";"))
Upvotes: 0
Reputation: 46
Maybe it's not the most efficient or elegant solution, but it works on the sample data. Hope it's also adequate for your needs:
library(stringr)
library(dplyr)
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
# split text into 3 columns
TestData[3:5] <- str_split_fixed(TestData$Information, ";", 3)
# filter Unknown and NA values, count frequencies to determine unique values
a <- TestData %>%
filter(!V5 %in% c("Unknown", "NA")) %>%
group_by(V5) %>%
summarise(count = n())
# join back to original data
TestData <- TestData %>%
left_join(a)
TestData$Clean <- ifelse(TestData$count > 1 | is.na(TestData$count), paste0(TestData$V4, ";", TestData$V5), TestData$V5)
Upvotes: 1