Haakonkas
Haakonkas

Reputation: 1041

Reading one-column text file and convert to several columns based on empty-row separator

I have a text file that have several entries, where each entry is separated by the number on the first row, and each part of the entry is separated by an empty row, like this:

1. Title of journal, with DOI and url.

This is the title of entry 1,
which may be over
several lines.

This is the authors of entry 1,
and may be over
several lines.

This is the information on each author:
Author 1 info
Author 2 info

This is the
abstract of the entry, which
may or may not be long and
over several lines.

Some additional information
may be found here,
such as DOI
and URLs

Lastly, a conflict of interest statement
is the last part of the entry.

2. Title of journal for entry 2, with DOI and url.

This is the title of entry 2,
which may be over
several lines.

This is the authors of entry 2,
and may be over
several lines.

This is the information on each author:
Author 1 info
Author 2 info
Author 3 info

This is the
abstract of the entry, which
may or may not be long and
over several lines.

Some additional information
may be found here,
such as DOI
and URLs

Lastly, a conflict of interest statement
is the last part of the entry.

3. Title of journal for entry 3, with DOI and url.

This is the title of entry 3,
which may be over
several lines.

This is the authors of entry 3,
and may be over
several lines.

This is the information on each author:
Author 1 info
Author 2 info
Author 3 info

This is the
abstract of the entry, which
may or may not be long and
over several lines.

What I want to do is to read this file into R and parse it into a data frame, where each separate part of each entry is a column, like this:

Journal title    Title    Authors                    Abstract    Info     Statement
1. Title         Title 1  Author1, author2           Text        Text     Text
2. Title         Title 2  Author1, author2, author3  Text        Text     Text
3. Title         Title 3  Author1, author2, author3  Text

However I have not managed to find any good solution for this, as each row in the file is read separately by R (I use read_delim from readr), making it difficult to work with. I saw a similar question here, but it doesn't involve converting the data into a data frame.

UPDATE: I added an additional entry to the example above to highlight that not all information is available for all entries (I did not know this at the time of writing). The third entry above highlights this by not having the conflict of interest section and the additional information section.

UPDATE2: I've uploaded the text file in question, which have been shared here: https://easyupload.io/55k9e5

Upvotes: 1

Views: 131

Answers (2)

barboulotte
barboulotte

Reputation: 405

A proposition :

text_base <- readLines("d:/temp/file.txt")

# group lines
group <- cumsum(text_base == "")
text_base <- aggregate(text_base, list(group), function(x) paste(x, collapse = " "))[, 2]
text_base <- stringr::str_squish(text_base)

# conversion to dataframe
text_base <- as.data.frame(text_base)
text_base$no <- ceiling(as.numeric(as.character(rownames(text_base))) / 7)
text_base$field <- rep(c("Journal_title", "Title", "Authors", "Authors_info", "Abstract", "Info", "Statement"))
text_base2 <- reshape2::dcast(text_base, no ~ field, value.var="text_base")

UPDATE

Your references are Pubmed references. I think you should have a look at easyPubmed package : https://www.data-pulse.com/dev_site/easypubmed/.

library(easyPubMed)
library(dplyr)
library(kableExtra)

# Query pubmed and fetch many results
my_query <- "30400810 32539900"
my_query <- get_pubmed_ids(my_query)

# Fetch data
my_abstracts_xml <- fetch_pubmed_data(my_query)  

# Store Pubmed Records as elements of a list
all_xml <- articles_to_list(my_abstracts_xml)

# Perform operation (use lapply here, no further parameters)
pm_df <- do.call(rbind, lapply(all_xml, article_to_df, 
                                  max_chars = -1, getAuthors = FALSE))

View(pm_df)

Regards,

Upvotes: 1

Andre Wildberg
Andre Wildberg

Reputation: 19088

I'd suggest a pre-process tidy step with awk:

Data:

$ cat pub.dat
1. Title of journal, with DOI and url.

This is the title of entry 1,
which may be over
several lines.

This is the authors of entry 1,
... <snip>

Pre-Process:

$ awk 'BEGIN{print "Journal title\tTitle\tAuthors\tAuthor Info\tAbstract\tInfo\tStatement"}
  /^[[:digit:]]/{ i=$1*1; j=1; entry[i,j] = $0; ind_i=i }
  NF<1{ j++ }
  !/^[[:digit:]]/{ entry[i,j] = entry[i,j]" "$0; ind_j[i]=j }
  END{ for(k=1;k<=ind_i;k++){
    for(m=1;m<ind_j[ind_i];m++){
      printf entry[k,m]"\t"
    }
    print entry[ind_i,ind_j[ind_i]]
  }
  }' pub.dat > pub.rdat

Here, it's important that the overall format of the text stays as it is, e.g. a number at the beginning of each record, empty line between each column-section etc.

R import:

read.table("pub.rdat", header=T, sep="\t")
#                                       Journal.title
#1             1. Title of journal, with DOI and url.
#2 2. Title of journal for entry 2, with DOI and url.
#                                                             Title
#1   This is the title of entry 1, which may be over several lines.
#2   This is the title of entry 2, which may be over several lines.
#                                                           Authors
#1   This is the authors of entry 1, and may be over several lines.
#2   This is the authors of entry 2, and may be over several lines.
#                                                                          Author.Info
#1                 This is the information on each author: Author 1 info Author 2 info
#2   This is the information on each author: Author 1 info Author 2 info Author 3 info
#                                                                                   Abstract
#1   This is the abstract of the entry, which may or may not be long and over several lines.
#2   This is the abstract of the entry, which may or may not be long and over several lines.
#                                                                   Info
#1   Some additional information may be found here, such as DOI and URLs
#2   Some additional information may be found here, such as DOI and URLs
#                                                                  Statement
#1   Lastly, a conflict of interest statement is the last part of the entry.
#2   Lastly, a conflict of interest statement is the last part of the entry.

Upvotes: 0

Related Questions