Reputation: 160

Using R, how can I disaggregate data separated by commas into separate rows?

I have a data frame where certain observations are separated by commas and I would like to separate them into different rows. I know there is a way to do this using the separate_rows function from tidyr, but I have an additional constraint.

Here is code to get my data frame:

dat <- structure(list(cit.num = c("29496, 37063", "29496, 37063", "36706, 36707", 
"36706, 36707"), civ.race = c("Black", "White", "Hispanic", "Hispanic"
), civ.sex = c("Male", "Female", "Female", "Male"), count = c(2L, 
2L, 2L, 2L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-4L))

Here is what the data look like:

  cit.num      civ.race civ.sex count
1 29496, 37063 Black    Male        2
2 29496, 37063 White    Female      2
3 36706, 36707 Hispanic Female      2
4 36706, 36707 Hispanic Male        2

cit.num identifies an individual citizen. I know that 29496 refers to the black male, and 37063 refers to the white female. Is there a way to separate the rows such that the first value is matched with the correct civ.race and civ.sex? Here is my desired output:

  cit.num civ.race civ.sex count
1 29496   Black    Male        2
2 37063   White    Female      2
3 36706   Hispanic Female      2
4 36707   Hispanic Male        2

Upvotes: 0

Answers (4)

Ronak Shah

Reputation: 388982

If we have only two numbers in cit.num we could use separate_rows to get data in different rows and select 1st and 4th row in each cit.num.

library(dplyr)

dat %>%
  mutate(temp = cit.num) %>%
  tidyr::separate_rows(cit.num) %>%
  group_by(temp) %>%
  slice(c(1, 4)) %>%
  ungroup() %>%
  select(-temp)


# cit.num civ.race civ.sex count
#  <chr>   <chr>    <chr>   <int>
#1 29496   Black    Male        2
#2 37063   White    Female      2
#3 36706   Hispanic Female      2
#4 36707   Hispanic Male        2

Upvotes: 0

Ben

Reputation: 30474

Here is a tidyverse alternative. You can separate your cit.num column into 2 columns, first and second.

Then, grouping by this combination, you set cit.num to be either the first or second number (first if the first of the two rows, and second otherwise).

library(tidyverse)

dat %>%
  separate(cit.num, into = c("first", "second")) %>%
  group_by(first, second) %>%
  mutate(cit.num = ifelse(row_number() == 1, first, second)) %>%
  ungroup() %>%
  select(c(-first, -second))

Output

# A tibble: 4 x 4
  civ.race civ.sex count cit.num
  <chr>    <chr>   <int> <chr>  
1 Black    Male        2 29496  
2 White    Female      2 37063  
3 Hispanic Female      2 36706  
4 Hispanic Male        2 36707

Upvotes: 0

Chris Ruehlemann

Reputation: 21400

You could use a for loop:

Key to it is that you define a sequence of uneven numbers:

seq(1, nrow(dat), by = 2)

That sequence you instruct for to loop over:

for(i in seq(1, nrow(dat), by = 2)){
  dat$cit.num[i] <- gsub(", \\d+", "", dat$cit.num[i])
  dat$cit.num[i+1] <- gsub("\\d+, ", "", dat$cit.num[i+1])
}

Output:

dat
  cit.num civ.race civ.sex count
1   29496    Black    Male     2
2   37063    White  Female     2
3   36706 Hispanic  Female     2
4   36707 Hispanic    Male     2

Upvotes: 1

Jonathan V. Solórzano

Reputation: 5620

If you already know the cit.num that should correspond to each combination of civ.race and civ.sex, I think it would be easier to do a join with the corresponding keys. Here is the code to do that using left_join.

library(tidyverse)

keys <- data.frame(civ.race = c("Black","Black","White","White","Hispanic","Hispanic"),
                   civ.sex = c("Male","Female","Male","Female","Male","Female"),
                   cit.num = c(29496,29495,37064,37063,36707,36706),
                   stringsAsFactors = F)

dat %>%
  #Drop you original cit.num column
  select(-cit.num) %>%
  #Do the join using civ.race and civ sex to match the entries in dat and keys
  left_join(keys, 
            by = c("civ.race","civ.sex"))

# A tibble: 4 x 4
# civ.race civ.sex count cit.num
# <chr>    <chr>   <int>   <dbl>
# 1 Black    Male        2   29496
# 2 White    Female      2   37063
# 3 Hispanic Female      2   36706
# 4 Hispanic Male        2   36707

Upvotes: 1

Using R, how can I disaggregate data separated by commas into separate rows?

Answers (4)

Related Questions