Reputation: 593
I have the following dataframe with definitions on names and cities which I would like to process, I'm not sure how to explain it so I included tables below with input and output.
Input:
+---------+-----------+------------+---------------+
| Varname | Component | names | cities |
+---------+-----------+------------+---------------+
| A | B | Jack,Bruce | New york |
| B | | Cathy | Boston,Miami |
| C | | Bob | New york |
| D | C | Dick,Nancy | Austin,Dallas |
| E | A,C | | |
| F | | Mandy | Manchester |
+---------+-----------+------------+---------------+
output:
+---------+-----------+----------------------+------------------------+
| Varname | Component | names | cities |
+---------+-----------+----------------------+------------------------+
| A | | Jack,Bruce,Cathy | New york,Boston,Miami |
| B | | Cathy | Boston,Miami |
| C | | Bob | New york |
| D | | Dick,Nancy,Bob | Austin,Dallas,New york |
| E | | Jack,Bruce,Cathy,Bob | New york,Boston,Miami |
| F | | Mandy | Manchester |
+---------+-----------+----------------------+------------------------+
As you hopefully see, I would like to take the component column, and for each of those Varnames in that column, look up the names and cities (and in reality I have slight more columns) and combine them so that I have a complete table. Is this possible? I have no idea where to start. My tables are not huge, so an for(){} statement could be applied.
->edit, I may have not have given a correct example, I have replace the input with something that is more consistent with my data.
dput() of input
structure(list(Varname = structure(1:6, .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"), Component = structure(c(3L, 1L, 1L, 4L, 2L, 1L), .Label = c("", "A,C", "B", "C"), class = "factor"), names = structure(c(5L, 3L, 2L, 4L, 1L, 6L), .Label = c("", "Bob", "Cathy", "Dick,Nancy", "Jack,Bruce", "Mandy"), class = "factor"), cities = structure(c(5L, 3L, 5L, 2L, 1L, 4L), .Label = c("", "Austin,Dallas", "Boston,Miami", "Manchester", "New york" ), class = "factor")), .Names = c("Varname", "Component", "names", "cities"), class = "data.frame", row.names = c(NA, -6L ))
Upvotes: 3
Views: 173
Reputation: 13284
Not the most appealing piece of R code (and definitely not the most efficient), but it gets the job done. Hopefully, someone else can improve it.
starting_df <- read.table(text="Varname|Component|names|cities
A||Jack,Bruce|New york
B||Cathy|Boston,Miami
C|A|Bob|New york
D|C|Dick,Nancy|Austin,Dallas",header=T, sep="|", stringsAsFactors=F)
##Grab all the rows whose Component values are in the Varname column and vice-versa
intermediate_df <- starting_df[(starting_df$Varname %in% starting_df$Component | starting_df$Component %in% starting_df$Varname ),]
##Change the rows in the names and cities columns to match your desired output (sorry about the for loop)
for (x in 1:nrow(intermediate_df)) {
if (x == 1) {
intermediate_df[x,'names'] <- intermediate_df$names[x]
intermediate_df[x,'cities'] <- intermediate_df$cities[x]
} else {
intermediate_df[x,'names'] <- paste0(unique(unlist(strsplit(paste(intermediate_df$names[x-1],intermediate_df$names[x],sep = ","),split=","))),collapse=",")
intermediate_df[x,'cities'] <- paste0(unique(unlist(strsplit(paste(intermediate_df$cities[x-1],intermediate_df$cities[x],sep = ","),split=","))),collapse=",")
}
}
##Binding the new dataset with the starting dataset (but only Varnames that are in the new dataset)
final_df <- rbind(intermediate_df,starting_df[!(starting_df$Varname %in% intermediate_df$Varname),])
##Order by the Varname column to get the desired output
final_df <- final_df[order(final_df$Varname),]
Your desired output:
Varname Component names cities
A Jack,Bruce New york
B Cathy Boston,Miami
C A Jack,Bruce,Bob New york
D C Jack,Bruce,Bob,Dick,Nancy New york,Austin,Dallas
This one uses quite the round of for loops
(something I don't like to do at all in R), but it seems to yield something:
##Setting up the new dataset
starting_df1 <- structure(list(Varname = structure(1:6, .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"),
Component = structure(c(3L, 1L, 1L, 4L, 2L, 1L), .Label = c("", "A,C", "B", "C"), class = "factor"),
names = structure(c(5L, 3L, 2L, 4L, 1L, 6L), .Label = c("", "Bob", "Cathy", "Dick,Nancy", "Jack,Bruce", "Mandy"), class = "factor"),
cities = structure(c(5L, 3L, 5L, 2L, 1L, 4L), .Label = c("", "Austin,Dallas", "Boston,Miami", "Manchester", "New york" ), class = "factor")),
.Names = c("Varname", "Component", "names", "cities"), class = "data.frame", row.names = c(NA, -6L ))
##Change the fields from factor variables to characters (so that you can use them for concatenating)
starting_df1 <- data.frame(apply(starting_df1, 2, FUN = function(x) {
as.character(x)
}), stringsAsFactors = F)
##Nested for loops: For every row that has a value for the Component column, find its matches (and their indices) in the Varname column
##Then for the combination of indices to change the values you wish to change through concatenation operations for both the names and cities columns
for (i in which(!nchar(starting_df1$Component)==0)) {
holder <- which(grepl(paste0(unlist(strsplit(starting_df1$Component[i],split=",")),collapse="|"),starting_df1$Varname))
for (j in holder) {
if (nchar(starting_df1$names[i])!=0) {
starting_df1[i, "names"] <- paste0(unique(unlist(strsplit(paste(starting_df1$names[i],starting_df1$names[j],sep = ","),split=","))),collapse=",")
starting_df1[i, "cities"] <- paste0(unique(unlist(strsplit(paste(starting_df1$cities[i],starting_df1$cities[j],sep = ","),split=","))),collapse=",")
} else {
starting_df1[i, "names"] <- starting_df1$names[j]
starting_df1[i, "cities"] <- starting_df1$cities[j]
}
}
}
print(starting_df1, row.names = F, right = F)
Varname Component names cities
A B Jack,Bruce,Cathy New york,Boston,Miami
B Cathy Boston,Miami
C Bob New york
D C Dick,Nancy,Bob Austin,Dallas,New york
E A,C Jack,Bruce,Cathy,Bob New york,Boston,Miami
F Mandy Manchester
Upvotes: 1