Reputation: 937
I am relatively new to R. I have a dataframe that has more than 10 million rows that contain 500,000 PMIDs (a type of ID). However, the code I use to run on it can only handle 4000-5000 PMIDs at most. Here is a sample of what the raw dataframe (it's all in one column) looks like:
PMID- 28524368
OT - cardiomyopathy
OT - encephalitis
LID - 10.1111/jmp.12273 [doi]
PL - Denmark
PMID- 28523858
OT - Pan troglodytes
PST - aheadofprint
LID - 10.1111/echo.13561 [doi]
STAT- Publisher
FAU - Ruivo, Catarina
PMID- 52528302
CI - (c) 2017, Wiley Periodicals, Inc.
DA - 20170518
OWN - NLM
PMID- 18325287
STAT- Publisher
OWN - NLM
DA - 20170519
LA - eng
PMID- 95625132
FAU - Oumerzouk, Jawad
JID - 0135232
PL - Australia
PMID- 47628853
LA - eng
STAT- Publisher
AID - 10.1111/jmp.12273 [doi]
As you can see in the example dataframe, there are only 6 PMIDs. So for the sake of the example, let's say I need to make multiple dataframes and each dataframe should only have 2 PMIDs (in my actual code I will probably do around 4000 PMIDs). Thus, I would like to split up my dataframe into 3 different dataframes that look like this (start at one PMID and end before the third PMID comes)
df1:
PMID- 28524368
OT - cardiomyopathy
OT - encephalitis
LID - 10.1111/jmp.12273 [doi]
PL - Denmark
PMID- 28523858
OT - Pan troglodytes
PST - aheadofprint
LID - 10.1111/echo.13561 [doi]
STAT- Publisher
FAU - Ruivo, Catarina
df2:
PMID- 52528302
CI - (c) 2017, Wiley Periodicals, Inc.
DA - 20170518
OWN - NLM
PMID- 18325287
STAT- Publisher
OWN - NLM
DA - 20170519
LA - eng
df3:
PMID- 95625132
FAU - Oumerzouk, Jawad
JID - 0135232
PL - Australia
PMID- 47628853
LA - eng
STAT- Publisher
AID - 10.1111/jmp.12273 [doi]
Note that the row differences between each PMID is different, so it must be done by string matching PMID. I don't know how to do this on such a large dataset (how do I not manually create the dataframes? for loop?)
Any suggestions would be appreciated.
Upvotes: 0
Views: 129
Reputation: 937
so although the solution of @thelatemail
seemed very promising, it did not work on my dataset. even after I tried the code on a smaller subset of only 1 million rows, it would constantly freeze my computer and I would have to continuously re-start my computer and re-load all the code and large file. perhaps it works better on numerical data or maybe on fewer data or maybe using data.table
or dplyr
or maybe I was just coding it wrong...not sure exactly why I wasn't able to implement it correctly (I would've experimented more, but I want to go home soon hah), but I was able to come up with my own solution:
# shows indices of each PMID
a <- which(grepl("^PMID", df$V1))
a <- as.data.frame(a)
# creates dataframes based on indices from `a` at every 4000 PMID
df1 <- original[c(a[1, 1]:a[4000, 1]), ]
df1 <- as.data.frame(df1)
df2 <- original[c(a[4001, 1]:a[8000, 1]), ]
df2 <- as.data.frame(df2)
etc...until df100, ha. very tedious, but I couldn't come up of a way to not do this manually...perhaps creating a function? regardless, my code ran within seconds, so I'm not complaining. plus the tedious work was just mindless work anyway that actually only took 10-15 minutes.
Upvotes: 0
Reputation: 93813
Make a little counter whenever you hit the start of a new group, then split. Here's a simplified example:
x <- rep(1:3,5)
grpsize <- 2
split(x, (cumsum(x==1)+grpsize-1) %/% grpsize)
#$`1`
#[1] 1 2 3 1 2 3
#
#$`2`
#[1] 1 2 3 1 2 3
#
#$`3`
#[1] 1 2 3
On your full data then you could use grepl
to identify the start of each group:
split(df, (cumsum(grepl("^PMID",df$var)) + grpsize - 1) %/% grpsize)
Arguably you could add the counter as a new column on your dataset and use it as an identifier to go from a long to a wide dataset.
Upvotes: 3