Fishing101
Fishing101

Reputation: 79

Extracting last word from many data frame columns (R)

I have a dataframe that contains a 3 columns. The data looks like this

V1                V2               V3
Auto = Chevy      Engine = V6      Trans = Auto
Auto = Chevy      Engine = V8      Trans = Manual
Auto = Chevy      Engine = V10     Trans = Manual

I want the dataframe to look like this:

Auto       Engine  Trans
Chevy      V6      Auto
Chevy      V8      Manual
Chevy      V10     Manual

In other words, retrieve the last string after the "=" and take the 1st value in the column and make it the column header. Or a way to just retrieve the last word of after the "=" and replace it the column without adding new columns.

Can this be done in R? Many thanks!

Upvotes: 7

Views: 2254

Answers (3)

akrun
akrun

Reputation: 887148

We can do this with base R only options

1) Using scan and sub - Remove the substring = followed by white space with sub after converting data.frame to matrix, then use scan to return a vector of words. Based on the recycling of logical vector (c(FALSE, TRUE)), we get the alternating words in 'v1' and assign the output to 'df2' while we change the column names with the unique elements of alternate values extracted from 'v1' using c(TRUE, FALSE) as logical recycling vector.

df2 <- df1
v1 <- scan(text=sub("=\\s+", "", as.matrix(df1)), what="", sep=" ", quiet=TRUE)
df2[] <- v1[c(FALSE, TRUE)]
colnames(df2) <- unique(v1[c(TRUE, FALSE)])
df2
#   Auto Engine  Trans
#1 Chevy     V6   Auto
#2 Chevy     V8 Manual
#3 Chevy    V10 Manual

2) Using sub - Extract the last word by capturing it as a group and replacing it with the backreference (\\1) after looping through the columns (lapply(df1, ..)

df2[] <- lapply(df1, function(x) sub(".*\\b(\\w+)$", "\\1", x))

3) Using strsplit - Split the string by delimiter ("=\\s+) and get the last element (tail, 1) while looping through the columns as in 2)

df2[] <- lapply(df1, function(x) sapply(strsplit(x, "=\\s+"), tail, 1))

We change the columns in 2nd and 3rd solutions by extracting the first word with sub on the unlistted first row

colnames(df2) <- sub("\\s+=.*", "", unlist(df1[1,], use.names = FALSE))

Or other options are based on package solutions

1) Using str_extract - Extract the word (\\w+) before the end $ of the string by looping over the columns with lapply and assign the list output to a copy of the original dataset ('df2'). Then, we change the column name, by extracting the first word using sub on the unlisted first row of original dataset.

library(stringr)
df2[] <- lapply(df1, function(x) str_extract(x, "\\w+$"))
colnames(df2) <- word(unlist(df1[1,]), 1)
df2
#   Auto Engine  Trans
#1 Chevy     V6   Auto
#2 Chevy     V8 Manual
#3 Chevy    V10 Manual

2) Using tidyverse

library(dplyr)
library(tidyr)
gather(df1) %>% 
      separate(value, into = c("header", "value")) %>%
      group_by(key) %>%
      mutate(i1 = row_number()) %>% 
      ungroup() %>% 
      select(-key) %>% 
      spread(header, value) %>%
      select(-i1)
# A tibble: 3 × 3
#   Auto Engine  Trans
#* <chr>  <chr>  <chr>
#1 Chevy     V6   Auto
#2 Chevy     V8 Manual
#3 Chevy    V10 Manual

data

df1 <- structure(list(V1 = c("Auto = Chevy", "Auto = Chevy", "Auto = Chevy"
), V2 = c("Engine = V6", "Engine = V8", "Engine = V10"), V3 = c("Trans = Auto", 
"Trans = Manual", "Trans = Manual")), .Names = c("V1", "V2", 
"V3"), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 3

hrbrmstr
hrbrmstr

Reputation: 78792

Or, we could avoid the stringr crutch and use a highly optimized function for just such this use case in stringi (most of stringr functions wrap stringi functions):

library(stringi)
library(dplyr)

read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df

mutate_all(df, funs(stri_extract_last_words))
##      V1  V2     V3
## 1 Chevy  V6   Auto
## 2 Chevy  V8 Manual
## 3 Chevy V10 Manual

More representative tidyverse with the "column name" req that could actually break your R script if the columns aren't as you imagine:

library(stringi)
library(dplyr)
library(purrr)

read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df

mutate_all(df, funs(stri_extract_last_words)) %>%
  setNames(mutate_all(df, stri_extract_first_words) %>%
             distinct() %>%
             flatten_chr())

More tidyverse and stringi with the very much assumed requirements that could actually break your R script if the columns aren't as you imagine:

library(stringi)
library(tidyverse)

read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df

by_row(df, function(x) {
  map(x, stri_match_all_regex, "(.*) = (.*)") %>%
    map(1) %>%
    map(~setNames(.[,3], .[,2])) %>%
    flatten_df()
}) %>%
  select(.out) %>%
  unnest()
## # A tibble: 3 × 3
##    Auto Engine  Trans
##   <chr>  <chr>  <chr>
## 1 Chevy     V6   Auto
## 2 Chevy     V8 Manual
## 3 Chevy    V10 Manual

Upvotes: 4

JWLM
JWLM

Reputation: 442

Well, if you don't mind just using old-style (pre-Hadley) R, here's a solution:

> x <- as.data.frame(list(c('Auto = Chevy', 'Auto = Chevy', 'Auto = Chevy'),
+ c('Engine = V6', 'Engine = V8', 'Engine = V10'),
+ c('Trans = Auto', 'Trans = Manual', 'Trans = Manual')),
+ stringsAsFactors=FALSE)
> values <- lapply(x, gsub, pattern='.*= ', replacement='')
> new.names <- lapply(x, gsub, pattern=' =.*', replacement='')
> new.names <- lapply(new.names, unique)
> names(values) <- new.names
> new.frame <- as.data.frame(values, stringsAsFactors = FALSE)
> new.frame
   Auto Engine  Trans
1 Chevy     V6   Auto
2 Chevy     V8 Manual
3 Chevy    V10 Manual

It won't work for a data frame with many columns, but it will work for a narrow data frame with many rows.

Upvotes: 5

Related Questions