Reputation: 1
I'm working with a dataset of author collaborations. Here's a subset of five articles with article ID, corresponding author (RP), corresponding author's country (Country), author full names (AF_1:AF_3), and author short names (AU_1:AU_3):
articles <- structure(list(
ArtID = 1:5,
RP = c("DE GARRIDO, L","CURSEU, PL","HENIKE, T","DI VINCENZO, F","OMIGIE, D"),
Country = c("spain", "romania", "germany", "italy", NA),
AF_1 = c("DE GARRIDO, LUIS","CURSEU, PETRU L.","STIELER, MAXIMILIAN","DI VINCENZO, FAUSTO","OMIGIE, DIANA"),
AF_2 = c(NA,"SCHRUIJER, SANDRA G. L.","HENIKE, TASSILO","IACOPINO, VALENTINA","RICCI, JESSICA"),
AF_3 = c(NA, "FODOR, OANA C.", NA, NA, NA),
AU_1 = c("DE GARRIDO L", "CURSEU PL", "STIELER M","DI VINCENZO F", "OMIGIE D"),
AU_2 = c(NA, "SCHRUIJER SGL", "HENIKE T","IACOPINO V", "RICCI J"),
AU_3 = c(NA, "FODOR OC", NA, NA, NA)),
row.names = c(NA,-5L),class = c("data.frame"))
> articles
ArtID RP Country AF_1 AF_2 AF_3 AU_1 AU_2 AU_3
1 1 DE GARRIDO, L spain DE GARRIDO, LUIS <NA> <NA> DE GARRIDO L <NA> <NA>
2 2 CURSEU, PL romania CURSEU, PETRU L. SCHRUIJER, SANDRA G. L. FODOR, OANA C. CURSEU PL SCHRUIJER SGL FODOR OC
3 3 HENIKE, T germany STIELER, MAXIMILIAN HENIKE, TASSILO <NA> STIELER M HENIKE T <NA>
4 4 DI VINCENZO, F italy DI VINCENZO, FAUSTO IACOPINO, VALENTINA <NA> DI VINCENZO F IACOPINO V <NA>
5 5 OMIGIE, D <NA> OMIGIE, DIANA RICCI, JESSICA <NA> OMIGIE D RICCI J <NA>
When pivoting the data from wide (articles) to long (authors), it automatically copies other column values for all authors (i.e., country). pivot_longer code and data:
authors <- articles %>% pivot_longer(cols=starts_with(c("AF","AU")),names_to=c(".value","ArtAthID"),names_sep="_",values_drop_na=T)
> authors
# A tibble: 10 × 6
ArtID RP Country ArtAthID AF AU
<int> <chr> <chr> <chr> <chr> <chr>
1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
3 2 CURSEU, PL romania 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
4 2 CURSEU, PL romania 3 FODOR, OANA C. FODOR OC
5 3 HENIKE, T germany 1 STIELER, MAXIMILIAN STIELER M
6 3 HENIKE, T germany 2 HENIKE, TASSILO HENIKE T
7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
8 4 DI VINCENZO, F italy 2 IACOPINO, VALENTINA IACOPINO V
9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
However, when pivoting, I want some columns (Country, Uni, Email) to stay with the corresponding author and be NA for the other authors. For example, Country should look like this:
> authors
# A tibble: 10 × 6
ArtID RP Country ArtAthID AF AU
<int> <chr> <chr> <chr> <chr> <chr>
1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
5 3 HENIKE, T NA 1 STIELER, MAXIMILIAN STIELER M
6 3 HENIKE, T germany 2 HENIKE, TASSILO HENIKE T
7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
I've tried a few things (trying to use matching between RP and AU), but am hitting a wall. Any suggestions for how to do this?
Thanks for any help!
Upvotes: 0
Views: 264
Reputation: 12586
I don't think you can do what you want within pivot_longer
, but I think this gives you what you need for one "additional" variable (Which is all you've given us in your test data...)
articles %>%
group_by(Country) %>%
pivot_longer(
cols=starts_with(c("AF","AU")),
names_to=c(".value","ArtAthID"),
names_sep="_",
values_drop_na=T
) %>%
mutate(Country=ifelse(row_number() == 1, Country, NA)) %>%
ungroup()
# A tibble: 10 × 6
# Groups: Country [5]
ArtID RP Country ArtAthID AF AU
<int> <chr> <chr> <chr> <chr> <chr>
1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
5 3 HENIKE, T germany 1 STIELER, MAXIMILIAN STIELER M
6 3 HENIKE, T NA 2 HENIKE, TASSILO HENIKE T
7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
if you need more than one additional variable, add each to the group_by()
call and then use across()
in the mutate()
. Something like (untested code):
...
%>% mutate(
across(
c(Country, Uni, Email),
function(x) ifelse(row_number() == 1, x, NA)
)
)
Upvotes: 0
Reputation: 160687
I think the pivot is working as designed, as it is keeping the data associated with all authors. What you're asking to do is de-duplicate Country
(and perhaps others) within a particular ArtID
.
Try this:
authors %>%
group_by(ArtID) %>%
mutate(across(c(Country), ~ replace(., duplicated(.), .[NA][1]))) %>%
ungroup()
# # A tibble: 10 x 6
# ArtID RP Country ArtAthID AF AU
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
# 2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
# 3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
# 4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
# 5 3 HENIKE, T germany 1 STIELER, MAXIMILIAN STIELER M
# 6 3 HENIKE, T NA 2 HENIKE, TASSILO HENIKE T
# 7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
# 8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
# 9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
# 10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
Notes:
Uni
and Email
, but those columns were not in the original data (which is fine). For this reason, I chose to use mutate(across(c(Country), ..))
instead of the more conventional mutate(Country = replace(..))
: include your other columns in that c(.)
vector, perhaps just mutate(across(c(Country, Uni, Email), ~ ...))
.replace
should be clear enough, but the .[NA][1]
is to make sure that the replacement (of NA
) is the same class of NA
as the original column. There are at least six different classes of NA
, and some R tools -- notably many within dplyr
and tidyr
-- complain when trying to combine an integer NA (NA_integer_
) with a logical (NA
) or string (NA_character_
) or real/floating-point (NA_real_
), to name a few. The use of .[NA]
will always give the correct class. The addition of [1]
is to get around the fact that replace(.)
requires the third argument to be the same length as the number of values to replace, not necessarily the same length as the input argument x
; this same-length is relaxed when recycling, so I truncate this to be always length-1.Your desired output for ArtID=3
showed the NA
before "germany"
, which seemed like just a typo in the sample data. If you want it matched (grepl
) between RP
and the other fields, then perhaps this works better (though some articles with this example clear out all countries):
authors %>%
mutate(tmp = mapply(function(x, ...) any(grepl(x, unlist(list(...)))), RP, AF, AU), across(c(Country), ~ if_else(tmp, ., .[NA]))) %>%
select(-tmp)
# # A tibble: 10 x 6
# ArtID RP Country ArtAthID AF AU
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
# 2 2 CURSEU, PL NA 1 CURSEU, PETRU L. CURSEU PL
# 3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
# 4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
# 5 3 HENIKE, T NA 1 STIELER, MAXIMILIAN STIELER M
# 6 3 HENIKE, T germany 2 HENIKE, TASSILO HENIKE T
# 7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
# 8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
# 9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
# 10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
This second method is flawed because of the mismatch in author name formats ("HENIKE, T" != "HENIKE T"
, for instance). If you can come up with a better way to match RP
with the other columns, then this may be more to your liking.
Upvotes: 1