katdataecon
katdataecon

Reputation: 185

Splitting one cell into multiple columns in R

So this is what one of my column looks like :

Infos
NAME: ANGELA SURNAME:SMITH AGE:22 CITY: LA
NAME: ANDREW SURNAME: D'ONOFRIO AGE:47 CITY: NYC

I'd like to create four columns :

NAME SURNAME AGE CITY
ANGELA SMITH 22 LA
ANDREW D'ONOFRIO 47 NYC

I read that we can use "separate" from tidyverse, and this what i tried.

library(tidyr)
library(tidyverse)

df <- infos %>% separate(Infos, c("NAME", "SURNAME","AGE","CITY"))

But this is the output :

NAME SURNAME AGE CITY
NAME ANGELA SURNAME SMITH
NAME ANDREW SURNAME D'ONOFRIO

Then i'd like to understand how to make R knows what it has to separate. Maybe that this exact topic have been treated here before (but i didn't find it) so feel free to redirect me if necessary !

Upvotes: 3

Views: 910

Answers (5)

Ronak Shah
Ronak Shah

Reputation: 388982

Base R option using strcapture -

strcapture('NAME:\\s*(.*)\\s*SURNAME:\\s*(.*)\\s*AGE:\\s*(.*)\\s*CITY:\\s*(.*)', 
           infos$Infos, proto = list(NAME = character(), 
           SURNAME = character(), AGE = numeric(), CITY = character()))

#    NAME    SURNAME  AGE CITY
#1 ANGELA      SMITH   22   LA
#2 ANDREW  D'ONOFRIO   47  NYC

Upvotes: 2

G. Grothendieck
G. Grothendieck

Reputation: 269596

1) extract Use extract with the pattern shown. The test data did not have any spaces within the contents of the fields but even if it did this should work.

library(dplyr)
library(tidyr)

pat <- "NAME: *(.*) SURNAME: *(.*) AGE: *(.*) CITY: *(.*)"
dat %>% 
  extract(Infos, c("NAME", "SURNAME", "AGE", "CITY"), pat, convert = TRUE)
##     NAME   SURNAME AGE CITY
## 1 ANGELA     SMITH  22   LA
## 2 ANDREW D'ONOFRIO  47  NYC

2) Base R Alternately using only base R we get this general solution which will continue to work even if the number of columns or their names change. This should also work if there are spaces in the contents of the fields. It works by converting Infos into dcf format followed by read.dcf.

dat |>
  with(gsub("(\\w+:)", "\n\\1", Infos)) |>
  textConnection() |>
  read.dcf() |>
  as.data.frame() |>
  type.convert(as.is = TRUE)
##     NAME   SURNAME AGE CITY
## 1 ANGELA     SMITH  22   LA
## 2 ANDREW D'ONOFRIO  47  NYC

Note

dat in reproducible form:

dat <-
structure(list(Infos = c("NAME: ANGELA SURNAME:SMITH AGE:22 CITY: LA", 
"NAME: ANDREW SURNAME: D'ONOFRIO AGE:47 CITY: NYC")), class = "data.frame", row.names = c(NA, 
-2L))

Upvotes: 6

AnilGoyal
AnilGoyal

Reputation: 26218

One more strategy

df <- structure(list(Infos = c("NAME: ANGELA SURNAME:SMITH AGE:22 CITY: LA", 
                           "NAME: ANDREW SURNAME: D'ONOFRIO AGE:47 CITY: NYC")), class = "data.frame", row.names = c(NA, 
                                                                                                                     -2L))
library(tidyverse)

df %>%
  mutate(Infos = gsub('\\:\\s*', ':', Infos)) %>%
  separate_rows(Infos, sep = '\\s') %>%
  separate(Infos, into = c('N', 'V'), sep = ':') %>%
  pivot_wider(names_from = N, values_from = V, values_fn = list) %>%
  unnest(everything())
#> # A tibble: 2 x 4
#>   NAME   SURNAME   AGE   CITY 
#>   <chr>  <chr>     <chr> <chr>
#> 1 ANGELA SMITH     22    LA   
#> 2 ANDREW D'ONOFRIO 47    NYC

Created on 2021-07-15 by the reprex package (v2.0.0)

Upvotes: 2

TarJae
TarJae

Reputation: 78927

Here is another solution using str_squish, str_replace_all and separate

library(dplyr)
library(stringr)
df %>% 
  mutate(Infos = str_squish(str_replace_all(Infos, ":", " "))) %>% 
  separate(Infos, c("helper1", "Name", "helper2", "Surname", "helper3", "Age", "helper4","City"), sep = " ") %>%
  select(-starts_with("helper"))

Output:

  Name   Surname   Age   City 
  <chr>  <chr>     <chr> <chr>
1 ANGELA SMITH     22    LA   
2 ANDREW D'ONOFRIO 47    NYC  

Upvotes: 1

M.Viking
M.Viking

Reputation: 5398

You can insert dummy columns, then delete them.

tibble(dat=c("NAME: ANGELA SURNAME:SMITH AGE:22 CITY: LA", 
             "NAME: ANDREW SURNAME: DONOFRIO AGE:47 CITY: NYC")) %>% 
    separate(dat, c("DEL1", "NAME", "DEL2", "SURNAME", "DEL3", "AGE", "DEL4", "CITY")) %>% 
    select(-DEL1, -DEL2, -DEL3, -DEL4)
 NAME   SURNAME  AGE   CITY 
 ANGELA SMITH    22    LA   
 ANDREW DONOFRIO 47    NYC  

Upvotes: 0

Related Questions