Reputation: 23
[enter image description here][1]Dear all, I have a very large file (14,566,680 records) with 2 variables (ID and A). The first variable (ID) is the individual (n=258) and each individual has 56,460 records (A)
I would like to write out a "transpose" file (i.e. 258 lines & 54460 columns).
When I execute the following code:
system.time(snp1 %>%
#filter(`Sample ID`=='8362974') %>%
select(`Sample ID`,A) %>%
mutate(id = row_number()) %>%
#head(n=nsnp) %>%
pivot_wider(names_from=id,
values_from = A)->T)
I got the following error:
Error in rep_len(NA_integer_, n) : invalid 'length.out' value
In addition: Warning message:
In nrow * ncol : NAs produced by integer overflow
Timing stopped at: 28.73 0.62 29.36
If I use only 1 ID it works correctly
Best Stefano
Upvotes: 0
Views: 1604
Reputation: 3090
Does it work if you group the records by individual before calculating the row_number (record ID)?
# made up sample
df <- tibble(`Sample ID` = rep(1:258, each = 56460)) %>%
mutate(A = rnorm(nrow(.)))
df %>%
group_by(`Sample ID`) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from=id,
values_from = A)
# A tibble: 258 x 56,461
# Groups: Sample ID [258]
`Sample ID` `1` `2` `3` `4` `5` `6` `7`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.49 0.546 0.0517 -0.480 -0.500 0.266 -1.52
2 2 -0.391 -0.855 -1.28 -0.0277 -0.999 0.617 -0.415
3 3 0.200 0.484 1.08 -0.568 1.16 1.75 -0.143
4 4 0.212 0.371 0.674 0.0481 -1.09 -1.07 0.160
5 5 0.409 1.54 0.931 -0.280 1.27 0.0447 0.426
6 6 -0.936 0.903 -0.0408 0.590 -1.52 -1.14 -0.600
7 7 -1.97 0.336 -0.233 0.488 0.995 -0.933 -1.90
8 8 -0.396 2.12 1.10 0.304 0.290 0.595 -1.32
9 9 -1.31 -0.124 -0.804 -0.447 1.12 -0.721 0.378
10 10 0.977 0.818 1.51 -0.258 -0.00794 0.0386 2.03
# ... with 248 more rows, and 56,453 more variables: ...
Upvotes: 2