sweetmusicality
sweetmusicality

Reputation: 937

Creating multiple dataframes out of one based on string search in R

I am relatively new to R. I have a dataframe that has more than 10 million rows that contain 500,000 PMIDs (a type of ID). However, the code I use to run on it can only handle 4000-5000 PMIDs at most. Here is a sample of what the raw dataframe (it's all in one column) looks like:

PMID- 28524368 OT - cardiomyopathy OT - encephalitis LID - 10.1111/jmp.12273 [doi] PL - Denmark PMID- 28523858 OT - Pan troglodytes PST - aheadofprint LID - 10.1111/echo.13561 [doi] STAT- Publisher FAU - Ruivo, Catarina PMID- 52528302 CI - (c) 2017, Wiley Periodicals, Inc. DA - 20170518 OWN - NLM PMID- 18325287 STAT- Publisher OWN - NLM DA - 20170519 LA - eng PMID- 95625132 FAU - Oumerzouk, Jawad JID - 0135232 PL - Australia PMID- 47628853 LA - eng STAT- Publisher AID - 10.1111/jmp.12273 [doi]

As you can see in the example dataframe, there are only 6 PMIDs. So for the sake of the example, let's say I need to make multiple dataframes and each dataframe should only have 2 PMIDs (in my actual code I will probably do around 4000 PMIDs). Thus, I would like to split up my dataframe into 3 different dataframes that look like this (start at one PMID and end before the third PMID comes)

df1: PMID- 28524368 OT - cardiomyopathy OT - encephalitis LID - 10.1111/jmp.12273 [doi] PL - Denmark PMID- 28523858 OT - Pan troglodytes PST - aheadofprint LID - 10.1111/echo.13561 [doi] STAT- Publisher FAU - Ruivo, Catarina

df2: PMID- 52528302 CI - (c) 2017, Wiley Periodicals, Inc. DA - 20170518 OWN - NLM PMID- 18325287 STAT- Publisher OWN - NLM DA - 20170519 LA - eng

df3: PMID- 95625132 FAU - Oumerzouk, Jawad JID - 0135232 PL - Australia PMID- 47628853 LA - eng STAT- Publisher AID - 10.1111/jmp.12273 [doi]

Note that the row differences between each PMID is different, so it must be done by string matching PMID. I don't know how to do this on such a large dataset (how do I not manually create the dataframes? for loop?)

Any suggestions would be appreciated.

Upvotes: 0

Views: 129

Answers (2)

sweetmusicality
sweetmusicality

Reputation: 937

so although the solution of @thelatemail seemed very promising, it did not work on my dataset. even after I tried the code on a smaller subset of only 1 million rows, it would constantly freeze my computer and I would have to continuously re-start my computer and re-load all the code and large file. perhaps it works better on numerical data or maybe on fewer data or maybe using data.table or dplyr or maybe I was just coding it wrong...not sure exactly why I wasn't able to implement it correctly (I would've experimented more, but I want to go home soon hah), but I was able to come up with my own solution:

# shows indices of each PMID
a <- which(grepl("^PMID", df$V1))
a <- as.data.frame(a)

# creates dataframes based on indices from `a` at every 4000 PMID
df1 <- original[c(a[1, 1]:a[4000, 1]), ]
df1 <- as.data.frame(df1)

df2 <- original[c(a[4001, 1]:a[8000, 1]), ]
df2 <- as.data.frame(df2)

etc...until df100, ha. very tedious, but I couldn't come up of a way to not do this manually...perhaps creating a function? regardless, my code ran within seconds, so I'm not complaining. plus the tedious work was just mindless work anyway that actually only took 10-15 minutes.

Upvotes: 0

thelatemail
thelatemail

Reputation: 93813

Make a little counter whenever you hit the start of a new group, then split. Here's a simplified example:

x <- rep(1:3,5)
grpsize <- 2
split(x, (cumsum(x==1)+grpsize-1) %/% grpsize)
#$`1`
#[1] 1 2 3 1 2 3
#
#$`2`
#[1] 1 2 3 1 2 3
#
#$`3`
#[1] 1 2 3

On your full data then you could use grepl to identify the start of each group:

split(df, (cumsum(grepl("^PMID",df$var)) + grpsize - 1) %/% grpsize)

Arguably you could add the counter as a new column on your dataset and use it as an identifier to go from a long to a wide dataset.

Upvotes: 3

Related Questions