Reputation: 185
So this is what one of my column looks like :
Infos |
---|
NAME: ANGELA SURNAME:SMITH AGE:22 CITY: LA |
NAME: ANDREW SURNAME: D'ONOFRIO AGE:47 CITY: NYC |
I'd like to create four columns :
NAME | SURNAME | AGE | CITY |
---|---|---|---|
ANGELA | SMITH | 22 | LA |
ANDREW | D'ONOFRIO | 47 | NYC |
I read that we can use "separate" from tidyverse, and this what i tried.
library(tidyr)
library(tidyverse)
df <- infos %>% separate(Infos, c("NAME", "SURNAME","AGE","CITY"))
But this is the output :
NAME | SURNAME | AGE | CITY |
---|---|---|---|
NAME | ANGELA | SURNAME | SMITH |
NAME | ANDREW | SURNAME | D'ONOFRIO |
Then i'd like to understand how to make R knows what it has to separate. Maybe that this exact topic have been treated here before (but i didn't find it) so feel free to redirect me if necessary !
Upvotes: 3
Views: 910
Reputation: 388982
Base R option using strcapture
-
strcapture('NAME:\\s*(.*)\\s*SURNAME:\\s*(.*)\\s*AGE:\\s*(.*)\\s*CITY:\\s*(.*)',
infos$Infos, proto = list(NAME = character(),
SURNAME = character(), AGE = numeric(), CITY = character()))
# NAME SURNAME AGE CITY
#1 ANGELA SMITH 22 LA
#2 ANDREW D'ONOFRIO 47 NYC
Upvotes: 2
Reputation: 269596
1) extract Use extract with the pattern shown. The test data did not have any spaces within the contents of the fields but even if it did this should work.
library(dplyr)
library(tidyr)
pat <- "NAME: *(.*) SURNAME: *(.*) AGE: *(.*) CITY: *(.*)"
dat %>%
extract(Infos, c("NAME", "SURNAME", "AGE", "CITY"), pat, convert = TRUE)
## NAME SURNAME AGE CITY
## 1 ANGELA SMITH 22 LA
## 2 ANDREW D'ONOFRIO 47 NYC
2) Base R Alternately using only base R we get this general solution which will continue to work even if the number of columns or their names change. This should also work if there are spaces in the contents of the fields. It works by converting Infos into dcf format followed by read.dcf.
dat |>
with(gsub("(\\w+:)", "\n\\1", Infos)) |>
textConnection() |>
read.dcf() |>
as.data.frame() |>
type.convert(as.is = TRUE)
## NAME SURNAME AGE CITY
## 1 ANGELA SMITH 22 LA
## 2 ANDREW D'ONOFRIO 47 NYC
dat in reproducible form:
dat <-
structure(list(Infos = c("NAME: ANGELA SURNAME:SMITH AGE:22 CITY: LA",
"NAME: ANDREW SURNAME: D'ONOFRIO AGE:47 CITY: NYC")), class = "data.frame", row.names = c(NA,
-2L))
Upvotes: 6
Reputation: 26218
One more strategy
df <- structure(list(Infos = c("NAME: ANGELA SURNAME:SMITH AGE:22 CITY: LA",
"NAME: ANDREW SURNAME: D'ONOFRIO AGE:47 CITY: NYC")), class = "data.frame", row.names = c(NA,
-2L))
library(tidyverse)
df %>%
mutate(Infos = gsub('\\:\\s*', ':', Infos)) %>%
separate_rows(Infos, sep = '\\s') %>%
separate(Infos, into = c('N', 'V'), sep = ':') %>%
pivot_wider(names_from = N, values_from = V, values_fn = list) %>%
unnest(everything())
#> # A tibble: 2 x 4
#> NAME SURNAME AGE CITY
#> <chr> <chr> <chr> <chr>
#> 1 ANGELA SMITH 22 LA
#> 2 ANDREW D'ONOFRIO 47 NYC
Created on 2021-07-15 by the reprex package (v2.0.0)
Upvotes: 2
Reputation: 78927
Here is another solution using str_squish
, str_replace_all
and separate
library(dplyr)
library(stringr)
df %>%
mutate(Infos = str_squish(str_replace_all(Infos, ":", " "))) %>%
separate(Infos, c("helper1", "Name", "helper2", "Surname", "helper3", "Age", "helper4","City"), sep = " ") %>%
select(-starts_with("helper"))
Output:
Name Surname Age City
<chr> <chr> <chr> <chr>
1 ANGELA SMITH 22 LA
2 ANDREW D'ONOFRIO 47 NYC
Upvotes: 1
Reputation: 5398
You can insert dummy columns, then delete them.
tibble(dat=c("NAME: ANGELA SURNAME:SMITH AGE:22 CITY: LA",
"NAME: ANDREW SURNAME: DONOFRIO AGE:47 CITY: NYC")) %>%
separate(dat, c("DEL1", "NAME", "DEL2", "SURNAME", "DEL3", "AGE", "DEL4", "CITY")) %>%
select(-DEL1, -DEL2, -DEL3, -DEL4)
NAME SURNAME AGE CITY ANGELA SMITH 22 LA ANDREW DONOFRIO 47 NYC
Upvotes: 0