Oda Ned
Oda Ned

Reputation: 47

Reading text file with abnormal delimitor

I am using an algorithm to lemmatize a text vector. The output is a .txt file stored in the way shown in the picture below. output

The original word is listed in the first column, whilst the various lemmas are listed in the second column, followed by some grammatical classifications. I want to read this into R, but have no idea how to do this. I have tried various forms of separators, but none seem to work.

Ideally, I want the data frame in R to look as follows, where I only read the first occurence of each lemma:

wanted structure

Perhaps the best option could be to read the data, keep only the first occurence (ie. da da adv), then do something like text to columns and only keep the first two columns.

Output from lemmatization algorithm:

"<da>"
    "da" adv
    "da" sbu
    "da" subst fork
"<dette>"
    "dette" det dem nøyt ent
    "dette" pron nøyt ent pers 3
    "dette" verb inf
"<er>"
    "være" verb pres <aux1/perf_part>
"<den>"
    "den" det dem fem ent
    "den" det dem mask ent
    "den" pron mask fem ent pers 3

Wanted structure:

da      da 
dette   dette
er  være
den den

Upvotes: 0

Views: 107

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

This worked for me when copy-pasted the text into a text file :

#Read the data
data <- readLines('temp.txt')
#index where new group starts. I have considered no whitespace at the beginning
# of the string as an indication for new group.
groups <- !startsWith(data, ' ')
#Since the first word is same in the entire group, we take first word 
#from 2nd element as 1st element is group name
value <- tapply(data, cumsum(groups), function(x) 
                     sub('"(\\w+).*', '\\1', trimws(x[2])))
#Create dataframe with group name and value. 
data.frame(groups = data[groups], value)


#    groups value
#1    "<da>"    da
#2 "<dette>" dette
#3    "<er>"  være
#4   "<den>"   den

Upvotes: 0

MrGumble
MrGumble

Reputation: 5766

Here's an interesting result: You can read the file quite nicely with read.table:

s <- '"<da>"
    "da" adv
    "da" sbu
    "da" subst fork
"<dette>"
    "dette" det dem nøyt ent
    "dette" pron nøyt ent pers 3
    "dette" verb inf
"<er>"
    "være" verb pres <aux1/perf_part>
"<den>"
    "den" det dem fem ent
    "den" det dem mask ent
    "den" pron mask fem ent pers 3
 '

 x <- read.table(sep='', text=s, colClasses=c('character','character'), flush=TRUE, fill=TRUE)

> x
        V1    V2   V3
1     <da>           
2       da   adv     
3       da   sbu     
4       da subst fork
5  <dette>           
6    dette   det  dem
7    dette  pron nøyt
8    dette  verb  inf
9     <er>           
10    være  verb pres
11   <den>           
12     den   det  dem
13     den   det  dem
14     den  pron mask

Using packages dplyr and tidyr, we can unpack it into:

(y <- x %>% mutate(a=grepl('<', V1, fixed=TRUE), b=cumsum(a)) %>% 
  group_by(b) %>% 
  summarise(verbs=list(t(unique(V1)))) %>% 
  unnest(cols=c(verbs)))
# A tibble: 4 x 2
      b verbs[,1] [,2] 
  <int> <chr>     <chr>
1     1 <da>      da   
2     2 <dette>   dette
3     3 <er>      være 
4     4 <den>     den  

result <- y$verbs
 result[,1] <- gsub('(<|>)', '', result[,1])


    [,1]    [,2]   
[1,] "da"    "da"   
[2,] "dette" "dette"
[3,] "er"    "være" 
[4,] "den"   "den"

Upvotes: 2

Related Questions