Reputation: 21
This seems like it should be a fairly simple problem, but I can't seem to find a straightforward solution.
I have a character list that looks like this:
my_info <- c("Fruits",
"North America",
"Apples",
"Michigan",
"Europe",
"Pomegranates",
"Greece",
"Oranges",
"Italy",
"Vegetables",
"North America",
"Potatoes",
"Idaho",
"Avocados",
"California",
"Europe",
"Artichokes",
"Italy",
"Meats",
"North America",
"Beef",
"Illinois")
I want to parse this character vector into a data frame that looks like this:
The food types and the region lists will always remain the same, but the foods and their locations are subject to change.
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
I was thinking I needed to use something like str_split, but use the food_types and regions as some sort of a delimiter? But I am not sure how to proceed. The character vector does have an order to it.
Thank you.
Upvotes: 2
Views: 4371
Reputation: 270075
Here are three alternatives. All of them use na.locf0
from zoo and the cn
vector only shown in the first.
1) Let cn
be a vector the same length as my_info
which identifies which column number of the output that element of my_info
belongs to. Let cdef
be an output column definition vector of 1:4 with the output column names as its names. Then for each output column create a vector the same length as my_info
whose rows corresponding to that column and NAs for the other elements. Then use na.locf0
to fill in the NA values and take the elements corresponding to column 4.
library(zoo)
cn <- (my_info %in% food_type) + 2 * (my_info %in% region)
cn[cn == 0] <- 3:4
cdef <- c(food_type = 1, region = 2, food = 3, location = 4)
m <- sapply(cdef, function(i) na.locf0(ifelse(cn == i, my_info, NA))[cn == 4])
giving:
> m
food_type region food location
[1,] "Fruits" "North America" "Apples" "Michigan"
[2,] "Fruits" "Europe" "Pomegranates" "Greece"
[3,] "Fruits" "Europe" "Oranges" "Italy"
[4,] "Vegetables" "North America" "Potatoes" "Idaho"
[5,] "Vegetables" "North America" "Avocados" "California"
[6,] "Vegetables" "Europe" "Artichokes" "Italy"
[7,] "Meats" "North America" "Beef" "Illinois"
We have created character matrix output since the output is entirely character but if you want a data frame anyways then use:
as.data.frame(mm, stringsAsFactors = FALSE)
2) Alternately, we can create m
from cn
by putting my_info[i]
into position (i, cn[i]) of an n x 4 matrix mm of NAs, using na.locf
to fill in the NAs and taking those rows corresponding to column 4.
n <- length(my_info)
m2 <- na.locf(replace(matrix(NA, n, 4), cbind(1:n, cn), my_info))[cn == 4, ]
colnames(m2) <- c("food_type", "region", "food", "location")
identical(m2, m) # test
## [1] TRUE
3) A third alternative for creating m
from cn
is to construct the matrix column by column like this:
m3 <- cbind( food_type = na.locf0(ifelse(cn == 1, my_info, NA))[cn == 3],
region = na.locf0(ifelse(cn == 2, my_info, NA))[cn == 3],
food = my_info[cn == 3],
location = my_info[cn == 4])
identical(m, m3) # test
## [1] TRUE
Upvotes: 0
Reputation: 23608
I have a long solution, but should work as long as food and location are always in the same order.
First creating a few data.frames with dplyr
.
library(dplyr)
info <- data_frame(my_info = my_info)
region <- data_frame(region_id = region, region = region)
food_type <- data_frame(food_type_id = food_type, food_type)
Next creating a data.frame that joins all of these together and fill missing values with tidyr
and removing the rows we do not need. Then The most important trick is the last one, creating a cols column based on the assumption that the order is always the same!
library(tidyr)
df <- info %>%
left_join(food_type, by = c("my_info" = "food_type_id")) %>%
left_join(region, by = c("my_info" = "region_id")) %>%
fill(food_type) %>%
group_by(food_type) %>%
fill(region) %>%
filter(!is.na(region) & !(my_info == region)) %>%
ungroup %>%
mutate(cols = rep(c("food", "location"), group_size(.)/2 ))
This returns:
# A tibble: 14 x 4
my_info food_type region cols
<chr> <chr> <chr> <chr>
1 Apples Fruits North America food
2 Michigan Fruits North America location
3 Pomegranates Fruits Europe food
4 Greece Fruits Europe location
5 Oranges Fruits Europe food
6 Italy Fruits Europe location
7 Beef Meats North America food
8 Illinois Meats North America location
9 Potatoes Vegetables North America food
10 Idaho Vegetables North America location
11 Avocados Vegetables North America food
12 California Vegetables North America location
13 Artichokes Vegetables Europe food
14 Italy Vegetables Europe location
Next use tidyr
to spread the cols into food and location columns.
df <- df %>%
group_by(food_type, region, cols) %>%
mutate(ind = row_number()) %>%
spread(cols, my_info) %>%
select(-ind)
# A tibble: 7 x 4
# Groups: food_type, region [5]
food_type region food location
<chr> <chr> <chr> <chr>
1 Fruits Europe Pomegranates Greece
2 Fruits Europe Oranges Italy
3 Fruits North America Apples Michigan
4 Meats North America Beef Illinois
5 Vegetables Europe Artichokes Italy
6 Vegetables North America Potatoes Idaho
7 Vegetables North America Avocados California
This can all be done in one go, just remove the intermediate step of creating a data.frame.
Upvotes: 0
Reputation: 20095
One solution can be to first convert your my_info
vector in a matrix using ncol = 4
. This will split your vector in a matrix/data frame.
Now, you can apply the rule of for food_type
and region
and swap any food_type
or region
which is present in other columns.
Note: I request OP to check data once, it seems every 4 elements are not able to make a complete row with description provided by OP.
df <- as.data.frame(matrix(my_info, ncol = 4, byrow = TRUE))
names(df) <- c("Foodtype", "Region", "Food", "Location")
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
t(apply(df,1,function(x){
for(i in seq_along(x)){
#One can think of writing a swap function here.
if(x[i] %in% region ){
temp = x[i]
x[i] = x[2]
x[2] = temp
}
#Swap any food_type wrongly placed in other column
if(x[i] %in% food_type ){
temp = x[i]
x[i] = x[1]
x[1] = temp
}
}
x
}))
# Foodtype Region Food Location
# [1,] "Fruits" "North America" "Apples" "Michigan"
# [2,] "Pomegranates" "Europe" "Greece" "Oranges"
# [3,] "Vegetables" "North America" "Italy" "Potatoes"
# [4,] "Idaho" "Europe" "California" "Avocados"
# [5,] "Meats" "North America" "Artichokes" "Italy"
# [6,] "Fruits" "North America" "Beef" "Illinois"
#
Upvotes: 2