Mac
Mac

Reputation: 21

R - Convert a Character Vector into a Data Frame

This seems like it should be a fairly simple problem, but I can't seem to find a straightforward solution.

I have a character list that looks like this:

my_info <- c("Fruits",
             "North America",
             "Apples",
             "Michigan",
             "Europe",
             "Pomegranates",
             "Greece",
             "Oranges",
             "Italy",
             "Vegetables",
             "North America",
             "Potatoes",
             "Idaho",
             "Avocados",
             "California",
             "Europe",
             "Artichokes",
             "Italy",
             "Meats",
             "North America",
             "Beef",
             "Illinois")

I want to parse this character vector into a data frame that looks like this:

screenshot of R console

The food types and the region lists will always remain the same, but the foods and their locations are subject to change.

food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")

I was thinking I needed to use something like str_split, but use the food_types and regions as some sort of a delimiter? But I am not sure how to proceed. The character vector does have an order to it.

Thank you.

Upvotes: 2

Views: 4371

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 270075

Here are three alternatives. All of them use na.locf0 from zoo and the cn vector only shown in the first.

1) Let cn be a vector the same length as my_info which identifies which column number of the output that element of my_info belongs to. Let cdef be an output column definition vector of 1:4 with the output column names as its names. Then for each output column create a vector the same length as my_info whose rows corresponding to that column and NAs for the other elements. Then use na.locf0 to fill in the NA values and take the elements corresponding to column 4.

library(zoo)

cn <- (my_info %in% food_type) + 2 * (my_info %in% region)
cn[cn == 0] <- 3:4

cdef <- c(food_type = 1, region = 2, food = 3, location = 4)

m <- sapply(cdef, function(i) na.locf0(ifelse(cn == i, my_info, NA))[cn == 4])

giving:

> m
     food_type    region          food           location    
[1,] "Fruits"     "North America" "Apples"       "Michigan"  
[2,] "Fruits"     "Europe"        "Pomegranates" "Greece"    
[3,] "Fruits"     "Europe"        "Oranges"      "Italy"     
[4,] "Vegetables" "North America" "Potatoes"     "Idaho"     
[5,] "Vegetables" "North America" "Avocados"     "California"
[6,] "Vegetables" "Europe"        "Artichokes"   "Italy"     
[7,] "Meats"      "North America" "Beef"         "Illinois"  

We have created character matrix output since the output is entirely character but if you want a data frame anyways then use:

as.data.frame(mm, stringsAsFactors = FALSE)

2) Alternately, we can create m from cn by putting my_info[i] into position (i, cn[i]) of an n x 4 matrix mm of NAs, using na.locf to fill in the NAs and taking those rows corresponding to column 4.

n <- length(my_info)
m2 <- na.locf(replace(matrix(NA, n, 4), cbind(1:n, cn), my_info))[cn == 4, ]
colnames(m2) <- c("food_type", "region", "food", "location")

identical(m2, m) # test
## [1] TRUE

3) A third alternative for creating m from cn is to construct the matrix column by column like this:

m3 <- cbind( food_type = na.locf0(ifelse(cn == 1, my_info, NA))[cn == 3], 
        region = na.locf0(ifelse(cn == 2, my_info, NA))[cn == 3], 
        food = my_info[cn == 3], 
        location = my_info[cn == 4])

identical(m, m3) # test
## [1] TRUE

Upvotes: 0

phiver
phiver

Reputation: 23608

I have a long solution, but should work as long as food and location are always in the same order.

First creating a few data.frames with dplyr.

library(dplyr)

info <- data_frame(my_info = my_info) 
region <- data_frame(region_id = region, region = region)
food_type <- data_frame(food_type_id = food_type, food_type)

Next creating a data.frame that joins all of these together and fill missing values with tidyr and removing the rows we do not need. Then The most important trick is the last one, creating a cols column based on the assumption that the order is always the same!

library(tidyr)

df <- info %>% 
  left_join(food_type, by = c("my_info" = "food_type_id")) %>% 
  left_join(region, by = c("my_info" = "region_id")) %>% 
  fill(food_type) %>% 
  group_by(food_type) %>% 
  fill(region) %>% 
  filter(!is.na(region) & !(my_info == region)) %>% 
  ungroup %>% 
  mutate(cols = rep(c("food", "location"), group_size(.)/2 ))

This returns:

# A tibble: 14 x 4
   my_info      food_type  region        cols    
   <chr>        <chr>      <chr>         <chr>   
 1 Apples       Fruits     North America food    
 2 Michigan     Fruits     North America location
 3 Pomegranates Fruits     Europe        food    
 4 Greece       Fruits     Europe        location
 5 Oranges      Fruits     Europe        food    
 6 Italy        Fruits     Europe        location
 7 Beef         Meats      North America food    
 8 Illinois     Meats      North America location
 9 Potatoes     Vegetables North America food    
10 Idaho        Vegetables North America location
11 Avocados     Vegetables North America food    
12 California   Vegetables North America location
13 Artichokes   Vegetables Europe        food    
14 Italy        Vegetables Europe        location

Next use tidyr to spread the cols into food and location columns.

df <- df %>%
  group_by(food_type, region, cols) %>%
  mutate(ind = row_number()) %>% 
  spread(cols, my_info) %>% 
  select(-ind)

# A tibble: 7 x 4
# Groups:   food_type, region [5]
  food_type  region        food         location  
  <chr>      <chr>         <chr>        <chr>     
1 Fruits     Europe        Pomegranates Greece    
2 Fruits     Europe        Oranges      Italy     
3 Fruits     North America Apples       Michigan  
4 Meats      North America Beef         Illinois  
5 Vegetables Europe        Artichokes   Italy     
6 Vegetables North America Potatoes     Idaho     
7 Vegetables North America Avocados     California

This can all be done in one go, just remove the intermediate step of creating a data.frame.

Upvotes: 0

MKR
MKR

Reputation: 20095

One solution can be to first convert your my_info vector in a matrix using ncol = 4. This will split your vector in a matrix/data frame.

Now, you can apply the rule of for food_type and region and swap any food_type or region which is present in other columns.

Note: I request OP to check data once, it seems every 4 elements are not able to make a complete row with description provided by OP.

df <- as.data.frame(matrix(my_info, ncol = 4, byrow = TRUE))

names(df) <- c("Foodtype", "Region", "Food", "Location")

food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")

t(apply(df,1,function(x){
  for(i in seq_along(x)){
    #One can think of writing a swap function here. 
    if(x[i] %in% region ){
      temp = x[i]
      x[i] = x[2]
      x[2] = temp
    }
    #Swap any food_type wrongly placed in other column
    if(x[i] %in% food_type ){
      temp = x[i]
      x[i] = x[1]
      x[1] = temp
    }

  }
  x
}))


#       Foodtype       Region          Food         Location  
# [1,] "Fruits"       "North America" "Apples"     "Michigan"
# [2,] "Pomegranates" "Europe"        "Greece"     "Oranges" 
# [3,] "Vegetables"   "North America" "Italy"      "Potatoes"
# [4,] "Idaho"        "Europe"        "California" "Avocados"
# [5,] "Meats"        "North America" "Artichokes" "Italy"   
# [6,] "Fruits"       "North America" "Beef"       "Illinois"
# 

Upvotes: 2

Related Questions