reading a text file with no particular format in R

Question

I would like to read a text file which has variable number of columns in R. My text file looks like follows. I would like to read into a data frame where I can just access the individual lines as df1 for the first line, df[2] for the second line and so on.

a       09
abandon*        12      16      19      20      24
abort*  60      61      62
about   10
above   10      41      42
abrupt* 37
absolut*        26
abuse*  12      16      18
abusive 12      16      18
academi*        47      48
accept  08      12      13      15      20      22      39
accepta*        08      12      13      15
accepted        08      12      13      15      20      38

I have tried the following

read.table("myfile",header=T,sep=" ")

but this inserts tab characters.

I have also tried readLines

singleString = readLines("myfile")

but this too inserts characters.

EDIT : (Thanks to Dominic Comtois for helping thus far, I have got it to work but still don't know what's going wrong)

Initially the words on the left and the numbers on the right were separate by tabs, not spaces. And so R wasn't able to read it correctly

df = read.table('filename',sep="|") df$V1 = as.character(df$V1) df1,"V1" "a 09"
So I replaced tabs with spaces in vi editor using the command :1,$s/ / /g
I was then able to read it in R using df = read.table('filename',sep="|") but certain lines looked like this (Surprisingly, when I read the file now, a lot of the apostrophes are gone. The first word was spelt as aren't, not it's arent)

df123, "arent 07 39 argu 12 16 18 31 32 arm 60 61 arms 60 61 army 31 around 10 41 arous 12 60 61 arrange 20 arrive 39 46 arrived 46 arrives 39 46 arriving 46 arrogan 12 16 18 arse 60 61 66 arses 66 arsehole 66 arter 60 61 arthr 60 61 as 10 asham 12 16 ashes 57 59 ask 27 29 31 32 39 asked 27 29 31 32 38 asking 27 29 31 32 asks 27 29 31 32 39 asleep 60 64 ass 60 61 66 assault 12 16 18 assembl 31 asses 66 asshole 66 associatio 47 49 assum 20 21 assur 12 13 15 26 asthma 60 61 at 10 ate 27 38 60 63 atop 10 41 42 attachment 12 13 14 attract 12 13 auditorium 47 48 august 37 aunt 31 35 autumn 37 aversi 12 16 17 avoid 12 16 20 24 awake 60 64 award 12 13 15 47 50 aware 20 22 away 10 awesome 12 13 awful 12 16 babe 31 36 babies 31 36 baby 31 36 bad 12 16 band 31 51 55 bank 56 bar 60 63 barrier 20 24 bars 60 63 baseball 47 48 51 53 bases 20 21 basis 20 21 basketball 47 48 51 53 bastard 12 16 18 66 bath 51 52 60 65 be 40 beaten 12 16 18 47 50 beaut 12 13 became 20 22 38 because 20 21 become 20 39 becomes 20 39 becoming 20 bed 51 52 64 been 38 beer 60 63 before 10 37 beg 31 32 39 began 37 38 begged 31 32 38 begging 31 32 begin 37 39 beginn 37 begins 37 39 begs 31 32 39 believe 20 22 39 believed 20 22 38 believes 20 22 39 believing 20 22 belly 60 61 below 10 41 43 beneath 10 41 43 benefit 12 13 39 benefits 12 13 47 49 benign 12 13 beside 10 41 besides 45 best 12 13 15 47 50 bet 25 39 56 bets 25 39 56 better 12 13 47 50 betting 25 56 between 10 41 bewilder 12 16 17 bi 60 62 bible 57 58 bicyc 51 53 big 41 billion 11 binge 60 61 63 biology 47 48 bitch 12 16 18 66 bitter 12 16 18 27 bladder 60 61 blam 12 16 18 31 32 bleed 60 61 bless 12 13 57 58 block 20 24 blood 60 61 board 47 49 boarder 41 bodi 60 61 body 60 61 bold 12 13 15 bone 60 61 bonus 47 49 boobs 60 61 62 66 book 47 48 bore 12 16 boring 12 16 borrow 56 boss 47 49 bother 12 16 bottom 41 43 bought 38 56 bowel 60 61 boy 31 36 boy 31 36"
So I wrote these to a new file as

write.table(df[grep(" ",df$V1),"V1"],'newlines')
But since we are writing so many sets of lines, it puts "" after every set. So I searched and replaced the " characters with empty, essentially removing them
I then opened them using the original commamnd and it worked, everything was in it's separate line

df = read.table('newlines',sep="|") df$V1 = as.character(df$V1)

I also opened the file after replacing tabs with spaces in a hex editor and did not see anything peculiar. This is the part from one line before where the problem starts

area  41
aren't  07  39
argu  12  16  18  31  32
arm  60  61

Corresponding hex

61 72 65 61 20 20 34 31 0A 61 72 65 6E 27 74 20 20 30 37 20 20 33 39 0A 61 72 67 75 20 20 31 32 20 20 31 36 20 20 31 38 20 20 33 31 20 20 33 32 0A 61 72 6D 20 20 36 30 20 20 36 31

If anyone would like to access the file, it can be found at http://aftabubuntu.cloudapp.net/LIWC2001_English.dic

Dominic Comtois · Accepted Answer

If you don't want to have the numbers considered as different "cells" or "fields", you can set sep as a character that is nowhere in your source file.

For instance:

df1 <- read.table("myfile",sep="|")

As for header=TRUE, this should be used only if your first line contains the names of your columns. If it's not the case, don't put it in. To skip the first line instead, just use skip=1.

Then you'll be able to access the individual lines with

df1[1,] # for first line
df1[2,] # for second line
        # and so on ...

reading a text file with no particular format in R

Answers (1)

Related Questions