Reputation: 3432
I have a dataframe such as
Groups NAMES start end
G1 A 1 50
G1 A 25 45
G1 B 20 51
G1 A 51 49
G2 A 200 400
G2 B 1 1600
G2 A 2000 3000
G2 B 4000 5000
and the idea is within each Groups
to look at NAMES
where start
& end
coordinates of A are within coordinates of B
for instance here in the example :
Groups NAMES start end
G1 A 1 50 <- A is outside any B coordinate
G1 A 25 45 <- A is **inside** the B coord `20-51`,then I remove these B row.
G1 B 20 51
G1 A 51 49 <- A is outside any B coordinate
G2 A 200 400 <- A is **inside** the B coordinate 1-1600, then I romove this B row.
G2 B 1 1600
G2 A 2000 3000 <- A is outside any B coordinate
G2 B 4000 5000 <- this one does not have any A inside it, then it will be kept in the output.
Then I should get as output :
Groups NAMES start end
G1 A 1 50
G1 A 25 45
G1 A 51 49
G2 A 200 400
G2 A 2000 3000
G2 B 4000 5000
Does someone have an idea please ?
Here is the dataframe in dput format if it can help you ? :
structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("G1", "G2"), class = "factor"), NAMES = structure(c(1L,
1L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
start = c(1L, 25L, 20L, 51L, 200L, 1L, 2000L, 4000L), end = c(50L,
45L, 51L, 49L, 400L, 1600L, 3000L, 5000L)), class = "data.frame", row.names = c(NA,
-8L))
Upvotes: 3
Views: 75
Reputation: 26218
This will also do using purrr::map_dfr
library(tidyverse)
df %>%
group_split(Groups) %>%
map_dfr(~ .x %>% mutate(r = row_number()) %>%
full_join(.x %>%
filter(NAMES == 'B'),
by = 'Groups') %>%
group_by(r) %>%
filter(any(NAMES.x == 'B' | start.x > start.y & end.x < end.y)) %>%
ungroup %>%
select(Groups, ends_with('.x')) %>%
distinct %>%
rename_with(~ gsub('\\.x', '', .), everything())
)
#> # A tibble: 6 x 4
#> Groups NAMES start end
#> <fct> <fct> <int> <int>
#> 1 G1 A 25 45
#> 2 G1 B 20 51
#> 3 G1 A 51 49
#> 4 G2 A 200 400
#> 5 G2 B 1 1600
#> 6 G2 B 4000 5000
Created on 2021-07-27 by the reprex package (v2.0.0)
Upvotes: 2
Reputation: 15072
Here's a possible approach. We'll split the df by NAMES
and join the two parts to each other by Groups
to do within-group comparisons. Only B
rows can get dropped, so those are the only ones whose row numbers we want to keep track of.
We can then just group by rowid
to tag the B
rows by whether or not they have any A inside them. Finally, filter to the B
to keep and concatenate back to the A rows.
library(tidyverse)
df <- structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("G1", "G2"), class = "factor"), NAMES = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), start = c(1L, 25L, 20L, 51L, 200L, 1L, 2000L, 4000L), end = c(50L, 45L, 51L, 49L, 400L, 1600L, 3000L, 5000L)), class = "data.frame", row.names = c(NA, -8L))
A <- filter(df, NAMES == "A")
B <- df %>%
filter(NAMES == "B") %>%
rowid_to_column()
comparison <- inner_join(A, B, by = "Groups") %>%
mutate(A_in_B = start.x >= start.y & end.x <= end.y) %>%
group_by(rowid) %>%
summarise(keep_B = !any(A_in_B))
B %>%
inner_join(comparison, by = "rowid") %>%
filter(keep_B) %>%
select(-rowid, -keep_B) %>%
bind_rows(A) %>%
arrange(Groups, NAMES)
#> Groups NAMES start end
#> 1 G1 A 1 50
#> 2 G1 A 25 45
#> 3 G1 A 51 49
#> 4 G2 A 200 400
#> 5 G2 A 2000 3000
#> 6 G2 B 4000 5000
Created on 2021-07-27 by the reprex package (v1.0.0)
Upvotes: 2