user2165857
user2165857

Reputation: 2690

Summarize Data String in R

I have a large dataset where some of the information needed is stored in the first column as a string separated by semicolons. For example:

TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))

Giving:

    Information Data
1   Forrest;Trees;Unknown   5
2   Forrest;Trees;Leaves    1
3   Forrest;Trees;Trunks    3
4   Forrest;Shrubs;Unknown  4
5   Forrest;Shrubs;Branches 2
6   Forrest;Shrubs;Leaves   1
7   Forrest;Shrubs;NA   3

I need to simplify the names so that I only have the last unique name that isn't "Unknown" or "NA" such that my dataframe becomes:

    Information Data
1   Trees;Unknown   5
2   Trees;Leaves    1
3   Trunks  3
4   Shrubs;Unknown  4
5   Branches    2
6   Shrubs;Leaves   1
7   Shrubs;NA   3

Upvotes: 0

Views: 142

Answers (2)

thisisrg
thisisrg

Reputation: 606

Generally it is not recommended to put multiple variables in the same column but using dplyr should give you what you want:

TestData_filtered<-TestData%>%separate(Information,into=c("common","TS","BL"),remove=FALSE)%>%filter(!grepl("Unknown|NA",BL))%>%mutate(wanted=paste(TS,BL,sep=";"))

Upvotes: 0

SRus
SRus

Reputation: 46

Maybe it's not the most efficient or elegant solution, but it works on the sample data. Hope it's also adequate for your needs:

library(stringr)
library(dplyr)


TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))

# split text into 3 columns
TestData[3:5] <- str_split_fixed(TestData$Information, ";", 3)

# filter Unknown and NA values, count frequencies to determine unique values
a <- TestData %>%
  filter(!V5 %in% c("Unknown", "NA")) %>%
  group_by(V5) %>%
  summarise(count = n())

# join back to original data
TestData <- TestData %>%
  left_join(a)


TestData$Clean <- ifelse(TestData$count > 1 | is.na(TestData$count), paste0(TestData$V4, ";", TestData$V5), TestData$V5)

Upvotes: 1

Related Questions