Reputation: 160
I have a data frame where certain observations are separated by commas and I would like to separate them into different rows. I know there is a way to do this using the separate_rows function from tidyr, but I have an additional constraint.
Here is code to get my data frame:
dat <- structure(list(cit.num = c("29496, 37063", "29496, 37063", "36706, 36707",
"36706, 36707"), civ.race = c("Black", "White", "Hispanic", "Hispanic"
), civ.sex = c("Male", "Female", "Female", "Male"), count = c(2L,
2L, 2L, 2L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
Here is what the data look like:
cit.num civ.race civ.sex count
1 29496, 37063 Black Male 2
2 29496, 37063 White Female 2
3 36706, 36707 Hispanic Female 2
4 36706, 36707 Hispanic Male 2
cit.num identifies an individual citizen. I know that 29496 refers to the black male, and 37063 refers to the white female. Is there a way to separate the rows such that the first value is matched with the correct civ.race and civ.sex? Here is my desired output:
cit.num civ.race civ.sex count
1 29496 Black Male 2
2 37063 White Female 2
3 36706 Hispanic Female 2
4 36707 Hispanic Male 2
Upvotes: 0
Views: 180
Reputation: 388982
If we have only two numbers in cit.num
we could use separate_rows
to get data in different rows and select 1st and 4th row in each cit.num
.
library(dplyr)
dat %>%
mutate(temp = cit.num) %>%
tidyr::separate_rows(cit.num) %>%
group_by(temp) %>%
slice(c(1, 4)) %>%
ungroup() %>%
select(-temp)
# cit.num civ.race civ.sex count
# <chr> <chr> <chr> <int>
#1 29496 Black Male 2
#2 37063 White Female 2
#3 36706 Hispanic Female 2
#4 36707 Hispanic Male 2
Upvotes: 0
Reputation: 30474
Here is a tidyverse
alternative. You can separate your cit.num
column into 2 columns, first
and second
.
Then, grouping by this combination, you set cit.num
to be either the first
or second
number (first
if the first of the two rows, and second
otherwise).
library(tidyverse)
dat %>%
separate(cit.num, into = c("first", "second")) %>%
group_by(first, second) %>%
mutate(cit.num = ifelse(row_number() == 1, first, second)) %>%
ungroup() %>%
select(c(-first, -second))
Output
# A tibble: 4 x 4
civ.race civ.sex count cit.num
<chr> <chr> <int> <chr>
1 Black Male 2 29496
2 White Female 2 37063
3 Hispanic Female 2 36706
4 Hispanic Male 2 36707
Upvotes: 0
Reputation: 21400
You could use a for
loop:
Key to it is that you define a sequence of uneven numbers:
seq(1, nrow(dat), by = 2)
That sequence you instruct for
to loop over:
for(i in seq(1, nrow(dat), by = 2)){
dat$cit.num[i] <- gsub(", \\d+", "", dat$cit.num[i])
dat$cit.num[i+1] <- gsub("\\d+, ", "", dat$cit.num[i+1])
}
Output:
dat
cit.num civ.race civ.sex count
1 29496 Black Male 2
2 37063 White Female 2
3 36706 Hispanic Female 2
4 36707 Hispanic Male 2
Upvotes: 1
Reputation: 5620
If you already know the cit.num that should correspond to each combination of civ.race and civ.sex, I think it would be easier to do a join with the corresponding keys. Here is the code to do that using left_join
.
library(tidyverse)
keys <- data.frame(civ.race = c("Black","Black","White","White","Hispanic","Hispanic"),
civ.sex = c("Male","Female","Male","Female","Male","Female"),
cit.num = c(29496,29495,37064,37063,36707,36706),
stringsAsFactors = F)
dat %>%
#Drop you original cit.num column
select(-cit.num) %>%
#Do the join using civ.race and civ sex to match the entries in dat and keys
left_join(keys,
by = c("civ.race","civ.sex"))
# A tibble: 4 x 4
# civ.race civ.sex count cit.num
# <chr> <chr> <int> <dbl>
# 1 Black Male 2 29496
# 2 White Female 2 37063
# 3 Hispanic Female 2 36706
# 4 Hispanic Male 2 36707
Upvotes: 1