Reputation: 2154
I would like to read a text file which has variable number of columns in R. My text file looks like follows. I would like to read into a data frame where I can just access the individual lines as df1 for the first line, df[2] for the second line and so on.
a 09
abandon* 12 16 19 20 24
abort* 60 61 62
about 10
above 10 41 42
abrupt* 37
absolut* 26
abuse* 12 16 18
abusive 12 16 18
academi* 47 48
accept 08 12 13 15 20 22 39
accepta* 08 12 13 15
accepted 08 12 13 15 20 38
I have tried the following
read.table("myfile",header=T,sep=" ")
but this inserts tab characters.
I have also tried readLines
singleString = readLines("myfile")
but this too inserts \t characters.
EDIT : (Thanks to Dominic Comtois for helping thus far, I have got it to work but still don't know what's going wrong)
Initially the words on the left and the numbers on the right were separate by tabs, not spaces. And so R wasn't able to read it correctly
df = read.table('filename',sep="|") df$V1 = as.character(df$V1) df1,"V1" "a\t09"
So I replaced tabs with spaces in vi editor using the command :1,$s/\t/ /g
I was then able to read it in R using df = read.table('filename',sep="|") but certain lines looked like this (Surprisingly, when I read the file now, a lot of the apostrophes are gone. The first word was spelt as aren't, not it's arent)
df123, "arent 07 39\nargu 12 16 18 31 32\narm 60 61\narms 60 61\narmy 31\naround 10 41\narous 12 60 61\narrange 20\narrive 39 46\narrived 46\narrives 39 46\narriving 46\narrogan 12 16 18\narse 60 61 66\narses 66\narsehole 66\narter 60 61\narthr 60 61\nas 10\nasham 12 16\nashes 57 59\nask 27 29 31 32 39\nasked 27 29 31 32 38\nasking 27 29 31 32\nasks 27 29 31 32 39\nasleep 60 64\nass 60 61 66\nassault 12 16 18\nassembl 31\nasses 66\nasshole 66\nassociatio 47 49\nassum 20 21\nassur 12 13 15 26\nasthma 60 61\nat 10\nate 27 38 60 63\natop 10 41 42\nattachment 12 13 14\nattract 12 13\nauditorium 47 48\naugust 37\naunt 31 35\nautumn 37\naversi 12 16 17\navoid 12 16 20 24\nawake 60 64\naward 12 13 15 47 50\naware 20 22\naway 10\nawesome 12 13\nawful 12 16\nbabe 31 36\nbabies 31 36\nbaby 31 36\nbad 12 16\nband 31 51 55\nbank 56\nbar 60 63\nbarrier 20 24\nbars 60 63\nbaseball 47 48 51 53\nbases 20 21\nbasis 20 21\nbasketball 47 48 51 53\nbastard 12 16 18 66\nbath 51 52 60 65\nbe 40\nbeaten 12 16 18 47 50\nbeaut 12 13\nbecame 20 22 38\nbecause 20 21\nbecome 20 39\nbecomes 20 39\nbecoming 20\nbed 51 52 64\nbeen 38\nbeer 60 63\nbefore 10 37\nbeg 31 32 39\nbegan 37 38\nbegged 31 32 38\nbegging 31 32\nbegin 37 39\nbeginn 37\nbegins 37 39\nbegs 31 32 39\nbelieve 20 22 39\nbelieved 20 22 38\nbelieves 20 22 39\nbelieving 20 22\nbelly 60 61\nbelow 10 41 43\nbeneath 10 41 43\nbenefit 12 13 39\nbenefits 12 13 47 49\nbenign 12 13\nbeside 10 41\nbesides 45\nbest 12 13 15 47 50\nbet 25 39 56\nbets 25 39 56\nbetter 12 13 47 50\nbetting 25 56\nbetween 10 41\nbewilder 12 16 17\nbi 60 62\nbible 57 58\nbicyc 51 53\nbig 41\nbillion 11\nbinge 60 61 63\nbiology 47 48\nbitch 12 16 18 66\nbitter 12 16 18 27\nbladder 60 61\nblam 12 16 18 31 32\nbleed 60 61\nbless 12 13 57 58\nblock 20 24\nblood 60 61\nboard 47 49\nboarder 41\nbodi 60 61\nbody 60 61\nbold 12 13 15\nbone 60 61\nbonus 47 49\nboobs 60 61 62 66\nbook 47 48\nbore 12 16\nboring 12 16\nborrow 56\nboss 47 49\nbother 12 16\nbottom 41 43\nbought 38 56\nbowel 60 61\nboy 31 36\nboy 31 36"
So I wrote these to a new file as
write.table(df[grep("\n",df$V1),"V1"],'newlines')
But since we are writing so many sets of lines, it puts "" after every set. So I searched and replaced the " characters with empty, essentially removing them
I then opened them using the original commamnd and it worked, everything was in it's separate line
df = read.table('newlines',sep="|") df$V1 = as.character(df$V1)
I also opened the file after replacing tabs with spaces in a hex editor and did not see anything peculiar. This is the part from one line before where the problem starts
area 41
aren't 07 39
argu 12 16 18 31 32
arm 60 61
Corresponding hex
61 72 65 61 20 20 34 31 0A 61 72 65 6E 27 74 20 20 30 37 20 20 33 39 0A 61 72 67 75 20 20 31 32 20 20 31 36 20 20 31 38 20 20 33 31 20 20 33 32 0A 61 72 6D 20 20 36 30 20 20 36 31
If anyone would like to access the file, it can be found at http://aftabubuntu.cloudapp.net/LIWC2001_English.dic
Upvotes: 1
Views: 891
Reputation: 10421
If you don't want to have the numbers considered as different "cells" or "fields", you can set sep
as a character that is nowhere in your source file.
For instance:
df1 <- read.table("myfile",sep="|")
As for header=TRUE
, this should be used only if your first line contains the names of your columns. If it's not the case, don't put it in. To skip the first line instead, just use skip=1
.
Then you'll be able to access the individual lines with
df1[1,] # for first line
df1[2,] # for second line
# and so on ...
Upvotes: 2