Reading text file with abnormal delimitor

Question

I am using an algorithm to lemmatize a text vector. The output is a .txt file stored in the way shown in the picture below.

The original word is listed in the first column, whilst the various lemmas are listed in the second column, followed by some grammatical classifications. I want to read this into R, but have no idea how to do this. I have tried various forms of separators, but none seem to work.

Ideally, I want the data frame in R to look as follows, where I only read the first occurence of each lemma:

Perhaps the best option could be to read the data, keep only the first occurence (ie. da da adv), then do something like text to columns and only keep the first two columns.

Output from lemmatization algorithm:

""
    "da" adv
    "da" sbu
    "da" subst fork
""
    "dette" det dem nøyt ent
    "dette" pron nøyt ent pers 3
    "dette" verb inf
""
    "være" verb pres 
""
    "den" det dem fem ent
    "den" det dem mask ent
    "den" pron mask fem ent pers 3

Wanted structure:

da      da 
dette   dette
er  være
den den

MrGumble · Accepted Answer

Here's an interesting result: You can read the file quite nicely with read.table:

s <- '""
    "da" adv
    "da" sbu
    "da" subst fork
""
    "dette" det dem nøyt ent
    "dette" pron nøyt ent pers 3
    "dette" verb inf
""
    "være" verb pres 
""
    "den" det dem fem ent
    "den" det dem mask ent
    "den" pron mask fem ent pers 3
 '

 x <- read.table(sep='', text=s, colClasses=c('character','character'), flush=TRUE, fill=TRUE)

> x
        V1    V2   V3
1                
2       da   adv     
3       da   sbu     
4       da subst fork
5             
6    dette   det  dem
7    dette  pron nøyt
8    dette  verb  inf
9                
10    være  verb pres
11              
12     den   det  dem
13     den   det  dem
14     den  pron mask

Using packages dplyr and tidyr, we can unpack it into:

(y <- x %>% mutate(a=grepl('<', V1, fixed=TRUE), b=cumsum(a)) %>% 
  group_by(b) %>% 
  summarise(verbs=list(t(unique(V1)))) %>% 
  unnest(cols=c(verbs)))
# A tibble: 4 x 2
      b verbs[,1] [,2] 
        
1     1       da   
2     2    dette
3     3       være 
4     4      den  

result <- y$verbs
 result[,1] <- gsub('(<|>)', '', result[,1])


    [,1]    [,2]   
[1,] "da"    "da"   
[2,] "dette" "dette"
[3,] "er"    "være" 
[4,] "den"   "den"

Reading text file with abnormal delimitor

Answers (2)

Related Questions