Reputation: 79
I have a dataframe that contains a 3 columns. The data looks like this
V1 V2 V3
Auto = Chevy Engine = V6 Trans = Auto
Auto = Chevy Engine = V8 Trans = Manual
Auto = Chevy Engine = V10 Trans = Manual
I want the dataframe to look like this:
Auto Engine Trans
Chevy V6 Auto
Chevy V8 Manual
Chevy V10 Manual
In other words, retrieve the last string after the "=" and take the 1st value in the column and make it the column header. Or a way to just retrieve the last word of after the "=" and replace it the column without adding new columns.
Can this be done in R? Many thanks!
Upvotes: 7
Views: 2254
Reputation: 887148
We can do this with base R
only options
1) Using scan
and sub
- Remove the substring =
followed by white space with sub
after converting data.frame
to matrix
, then use scan
to return a vector
of words. Based on the recycling of logical vector (c(FALSE, TRUE)
), we get the alternating words in 'v1' and assign the output to 'df2' while we change the column names with the unique
elements of alternate values extracted from 'v1' using c(TRUE, FALSE)
as logical recycling vector
.
df2 <- df1
v1 <- scan(text=sub("=\\s+", "", as.matrix(df1)), what="", sep=" ", quiet=TRUE)
df2[] <- v1[c(FALSE, TRUE)]
colnames(df2) <- unique(v1[c(TRUE, FALSE)])
df2
# Auto Engine Trans
#1 Chevy V6 Auto
#2 Chevy V8 Manual
#3 Chevy V10 Manual
2) Using sub
- Extract the last word by capturing it as a group and replacing it with the backreference (\\1
) after looping through the columns (lapply(df1, ..
)
df2[] <- lapply(df1, function(x) sub(".*\\b(\\w+)$", "\\1", x))
3) Using strsplit
- Split the string by delimiter ("=\\s+
) and get the last element (tail, 1
) while looping through the columns as in 2)
df2[] <- lapply(df1, function(x) sapply(strsplit(x, "=\\s+"), tail, 1))
We change the columns in 2nd and 3rd solutions by extracting the first word with sub
on the unlist
ted first row
colnames(df2) <- sub("\\s+=.*", "", unlist(df1[1,], use.names = FALSE))
Or other options are based on package solutions
1) Using str_extract
- Extract the word (\\w+
) before the end $
of the string by looping over the columns with lapply
and assign the list
output to a copy of the original dataset ('df2'). Then, we change the column name, by extracting the first word using sub
on the unlist
ed first row of original dataset.
library(stringr)
df2[] <- lapply(df1, function(x) str_extract(x, "\\w+$"))
colnames(df2) <- word(unlist(df1[1,]), 1)
df2
# Auto Engine Trans
#1 Chevy V6 Auto
#2 Chevy V8 Manual
#3 Chevy V10 Manual
2) Using tidyverse
library(dplyr)
library(tidyr)
gather(df1) %>%
separate(value, into = c("header", "value")) %>%
group_by(key) %>%
mutate(i1 = row_number()) %>%
ungroup() %>%
select(-key) %>%
spread(header, value) %>%
select(-i1)
# A tibble: 3 × 3
# Auto Engine Trans
#* <chr> <chr> <chr>
#1 Chevy V6 Auto
#2 Chevy V8 Manual
#3 Chevy V10 Manual
df1 <- structure(list(V1 = c("Auto = Chevy", "Auto = Chevy", "Auto = Chevy"
), V2 = c("Engine = V6", "Engine = V8", "Engine = V10"), V3 = c("Trans = Auto",
"Trans = Manual", "Trans = Manual")), .Names = c("V1", "V2",
"V3"), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 3
Reputation: 78792
Or, we could avoid the stringr
crutch and use a highly optimized function for just such this use case in stringi
(most of stringr
functions wrap stringi
functions):
library(stringi)
library(dplyr)
read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df
mutate_all(df, funs(stri_extract_last_words))
## V1 V2 V3
## 1 Chevy V6 Auto
## 2 Chevy V8 Manual
## 3 Chevy V10 Manual
More representative tidyverse with the "column name" req that could actually break your R script if the columns aren't as you imagine:
library(stringi)
library(dplyr)
library(purrr)
read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df
mutate_all(df, funs(stri_extract_last_words)) %>%
setNames(mutate_all(df, stri_extract_first_words) %>%
distinct() %>%
flatten_chr())
More tidyverse and stringi
with the very much assumed requirements that could actually break your R script if the columns aren't as you imagine:
library(stringi)
library(tidyverse)
read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df
by_row(df, function(x) {
map(x, stri_match_all_regex, "(.*) = (.*)") %>%
map(1) %>%
map(~setNames(.[,3], .[,2])) %>%
flatten_df()
}) %>%
select(.out) %>%
unnest()
## # A tibble: 3 × 3
## Auto Engine Trans
## <chr> <chr> <chr>
## 1 Chevy V6 Auto
## 2 Chevy V8 Manual
## 3 Chevy V10 Manual
Upvotes: 4
Reputation: 442
Well, if you don't mind just using old-style (pre-Hadley) R, here's a solution:
> x <- as.data.frame(list(c('Auto = Chevy', 'Auto = Chevy', 'Auto = Chevy'),
+ c('Engine = V6', 'Engine = V8', 'Engine = V10'),
+ c('Trans = Auto', 'Trans = Manual', 'Trans = Manual')),
+ stringsAsFactors=FALSE)
> values <- lapply(x, gsub, pattern='.*= ', replacement='')
> new.names <- lapply(x, gsub, pattern=' =.*', replacement='')
> new.names <- lapply(new.names, unique)
> names(values) <- new.names
> new.frame <- as.data.frame(values, stringsAsFactors = FALSE)
> new.frame
Auto Engine Trans
1 Chevy V6 Auto
2 Chevy V8 Manual
3 Chevy V10 Manual
It won't work for a data frame with many columns, but it will work for a narrow data frame with many rows.
Upvotes: 5