Reputation: 1041
I have a text file that have several entries, where each entry is separated by the number on the first row, and each part of the entry is separated by an empty row, like this:
1. Title of journal, with DOI and url.
This is the title of entry 1,
which may be over
several lines.
This is the authors of entry 1,
and may be over
several lines.
This is the information on each author:
Author 1 info
Author 2 info
This is the
abstract of the entry, which
may or may not be long and
over several lines.
Some additional information
may be found here,
such as DOI
and URLs
Lastly, a conflict of interest statement
is the last part of the entry.
2. Title of journal for entry 2, with DOI and url.
This is the title of entry 2,
which may be over
several lines.
This is the authors of entry 2,
and may be over
several lines.
This is the information on each author:
Author 1 info
Author 2 info
Author 3 info
This is the
abstract of the entry, which
may or may not be long and
over several lines.
Some additional information
may be found here,
such as DOI
and URLs
Lastly, a conflict of interest statement
is the last part of the entry.
3. Title of journal for entry 3, with DOI and url.
This is the title of entry 3,
which may be over
several lines.
This is the authors of entry 3,
and may be over
several lines.
This is the information on each author:
Author 1 info
Author 2 info
Author 3 info
This is the
abstract of the entry, which
may or may not be long and
over several lines.
What I want to do is to read this file into R and parse it into a data frame, where each separate part of each entry is a column, like this:
Journal title Title Authors Abstract Info Statement
1. Title Title 1 Author1, author2 Text Text Text
2. Title Title 2 Author1, author2, author3 Text Text Text
3. Title Title 3 Author1, author2, author3 Text
However I have not managed to find any good solution for this, as each row in the file is read separately by R (I use read_delim
from readr
), making it difficult to work with. I saw a similar question here, but it doesn't involve converting the data into a data frame.
UPDATE: I added an additional entry to the example above to highlight that not all information is available for all entries (I did not know this at the time of writing). The third entry above highlights this by not having the conflict of interest section and the additional information section.
UPDATE2: I've uploaded the text file in question, which have been shared here: https://easyupload.io/55k9e5
Upvotes: 1
Views: 131
Reputation: 405
A proposition :
text_base <- readLines("d:/temp/file.txt")
# group lines
group <- cumsum(text_base == "")
text_base <- aggregate(text_base, list(group), function(x) paste(x, collapse = " "))[, 2]
text_base <- stringr::str_squish(text_base)
# conversion to dataframe
text_base <- as.data.frame(text_base)
text_base$no <- ceiling(as.numeric(as.character(rownames(text_base))) / 7)
text_base$field <- rep(c("Journal_title", "Title", "Authors", "Authors_info", "Abstract", "Info", "Statement"))
text_base2 <- reshape2::dcast(text_base, no ~ field, value.var="text_base")
UPDATE
Your references are Pubmed references. I think you should have a look at easyPubmed package : https://www.data-pulse.com/dev_site/easypubmed/.
library(easyPubMed)
library(dplyr)
library(kableExtra)
# Query pubmed and fetch many results
my_query <- "30400810 32539900"
my_query <- get_pubmed_ids(my_query)
# Fetch data
my_abstracts_xml <- fetch_pubmed_data(my_query)
# Store Pubmed Records as elements of a list
all_xml <- articles_to_list(my_abstracts_xml)
# Perform operation (use lapply here, no further parameters)
pm_df <- do.call(rbind, lapply(all_xml, article_to_df,
max_chars = -1, getAuthors = FALSE))
View(pm_df)
Regards,
Upvotes: 1
Reputation: 19088
I'd suggest a pre-process tidy step with awk
:
Data:
$ cat pub.dat
1. Title of journal, with DOI and url.
This is the title of entry 1,
which may be over
several lines.
This is the authors of entry 1,
... <snip>
Pre-Process:
$ awk 'BEGIN{print "Journal title\tTitle\tAuthors\tAuthor Info\tAbstract\tInfo\tStatement"}
/^[[:digit:]]/{ i=$1*1; j=1; entry[i,j] = $0; ind_i=i }
NF<1{ j++ }
!/^[[:digit:]]/{ entry[i,j] = entry[i,j]" "$0; ind_j[i]=j }
END{ for(k=1;k<=ind_i;k++){
for(m=1;m<ind_j[ind_i];m++){
printf entry[k,m]"\t"
}
print entry[ind_i,ind_j[ind_i]]
}
}' pub.dat > pub.rdat
Here, it's important that the overall format of the text stays as it is, e.g. a number at the beginning of each record, empty line between each column-section etc.
R import:
read.table("pub.rdat", header=T, sep="\t")
# Journal.title
#1 1. Title of journal, with DOI and url.
#2 2. Title of journal for entry 2, with DOI and url.
# Title
#1 This is the title of entry 1, which may be over several lines.
#2 This is the title of entry 2, which may be over several lines.
# Authors
#1 This is the authors of entry 1, and may be over several lines.
#2 This is the authors of entry 2, and may be over several lines.
# Author.Info
#1 This is the information on each author: Author 1 info Author 2 info
#2 This is the information on each author: Author 1 info Author 2 info Author 3 info
# Abstract
#1 This is the abstract of the entry, which may or may not be long and over several lines.
#2 This is the abstract of the entry, which may or may not be long and over several lines.
# Info
#1 Some additional information may be found here, such as DOI and URLs
#2 Some additional information may be found here, such as DOI and URLs
# Statement
#1 Lastly, a conflict of interest statement is the last part of the entry.
#2 Lastly, a conflict of interest statement is the last part of the entry.
Upvotes: 0