Wilcar
Wilcar

Reputation: 2513

R - raw text to a data.frame

I work on raw textual data from a scanned catalog. I want to convert my string vector to a data.frame object. My vector consists of an alphabetical list of people who performed each one or more work.
- People names are upper case.
- Each work is numbered.
- The numbering works is continuous.


AADFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB 
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
CCDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.

expected result 1

Author     Work  
AA DFDS    1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB         2 Nulla sollicitudin elit in purus egestas, in placerat velit 
BBDDED     3 Nunc et eros eget turpis sollicitudin mollis id et 
BBDDED     4 Mauris condimentum velit eu consequat feugiat.
BBDDED     5 Suspendisse sit amet metus vitae est eleifend tincidunt.
CCDDFSF    6 Sed cursus augue in tempus scelerisque.
CCDDFSF    7 in commodo enim in laoreet gravida.

expected result 2 with a column for each work

Author  |   Work1  |  Work2  |  Work3  |  Work(x)  

The data is imported into R with:

readlines ("clipboard", encoding = " latin1 ")

I am able to identify lines including artist names in capital letters with different regex

e.g.

^[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']

I am able to identify lines including artworks

^[0-9]+[\s]

Any help would be greatly appreciated.

Upvotes: 3

Views: 578

Answers (2)

Wilcar
Wilcar

Reputation: 2513

toydata<- readLines("clipboard")

#find lines beginning with any number; flags lines with authors
work_id <- grepl("^[0-9]" , toydata)

#rle finds subsequent runs of an element within a vector
RLE <- rle(work_id)

#work_id filters out the lines with author names
#rep(toydata[!work_id],RLE$lengths[RLE$values]) repeats the ...
#... author name (times = number of author's works)
df_toydata <- data.frame(work = toydata[work_id],
                     Author = rep(toydata[!work_id],
                                  RLE$lengths[RLE$values]),
                     stringsAsFactors=FALSE)

#we have to order the data.frame by author just in case
#some author appears again
df_toydata=df_toydata[order(df_toydata$Author),]
#we can now add a column with a numbering of each author's works
df_toydata$N=sequence(rle(df_toydata$Author)$lengths)

#format long to large
#we pivot the data; rows correspond to authors, columns to works
df2=reshape2::dcast(df_toydata,Author~N,value.var = "work")
colnames(df2)[-1]=paste0("Work",1:(ncol(df2)-1))

Upvotes: 2

cryo111
cryo111

Reputation: 4474

This gives the correct result with your sample data.

txt="
AADFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB 
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
CCDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida."

last_author=""
author_count=0
#the first scan splits the data by line, i.e., sep="\n"
#then for each line, we split by whitespace, i.e., sep=" "
#if the first element is numeric we increase the
#respective author's work counter "author_count" and
#we return the the work in a data.frame
#if the first element is non-numeric, we have
#encountered a new author
#we store the new author name in "last_author"
#(and remove trailing whitespaces at the end)
result1=do.call("rbind",
                lapply(as.list(scan(text=txt,
                                    what="character",
                                    sep="\n",
                                    quiet=TRUE)),
                       function(x) {
                         tmp=scan(text=x,what="character",sep=" ",quiet=TRUE)
                         if (grepl("[0-9]",tmp[1])) {
                           author_count<<-author_count+1
                           data.frame(Author=last_author,N=author_count,Work=x)
                         } else {
                           last_author<<-gsub("\\s*$","",x)
                           author_count<<-0
                           NULL
                         }}))

#we pivot the data; rows correspond to authors, columns to works
result2=reshape2::dcast(result1,Author~N,value.var = "Work")
#just renaming the columns
colnames(result2)[-1]=paste0("Work",1:(ncol(result2)-1))
result2

Upvotes: 3

Related Questions