Reputation: 2283
When I tried to read the following table into dataframe (data100) by:
data100 <- read.table(header=TRUE, text='
verb_object SESSION_ID
1: BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371
3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
4: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373
6: 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368
')
I get the following error:
Error in read.table(header = TRUE, text = "\n verb_object SESSION_ID\n BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370\n BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371\n 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282\n 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282\n 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373\n 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368\n") :
duplicate 'row.names' are not allowed
How can I read it?
After usage of
lines <- readLines(textConnection(" verb_object SESSION_ID
> data100<-read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=TRUE), sep='', fill=TRUE)
The result is as followed:
> data100
V1 V2 V3 V4 V5 V6 V7
1 verb_object SESSION_ID NA NA
2 1: BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370 2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371
3 3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282 4: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
4 5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373 6: 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368
>
Upvotes: 1
Views: 700
Reputation: 886948
We can read it with readLines
, place the quotes using gsub
, and read with read.table
lines <- readLines(textConnection("verb_object SESSION_ID
1: BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371
3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
4: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373
6: 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368"))
read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=TRUE), sep='')
# verb_object SESSION_ID
#1: BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370
#2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371
#3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
#4: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
#5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373
#6: 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368
The OP's new dataset can be read with readLines
as before,
lines <- readLines(textConnection("items newitem
1: BA31C1CC63E5043483FAE25F085E25E5 INSERT OV1
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE OV2
3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE OV3
4: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE OV4
5: 6D944D54C54ED75D487288FE1505BB59 INSERT OV5"))
We should note that the pattern we matched in the earlier dataset (\\s+(?=\\s[0-9])
) won't work here as the first character in 'SESSIONID' is number, while in 'newitem' it is a uppercase letter. So, we match one or more characters that are not :
from the beginning of the string (^[^:]+
) followed by :
, followed by one or more space (\\s+
), then we capture the characters as a group using parentheses ()
i.e. one or more characters that are not space followed by one or more space and characters not space (([^ ]+\\s+[^ ]+)
, match one or more space (\\s+
) followed by one or more characters till the end of the string as another capture group ((.*)$
). We replace by placing quotes around the first capture group ('\\1'
) followed by space followed by second capture group.
read.table(text=gsub("^[^:]+:\\s+([^ ]+\\s+[^ ]+)\\s+(.*)$",
"'\\1' \\2", lines), header=TRUE)
# items newitem
#1 BA31C1CC63E5043483FAE25F085E25E5 INSERT OV1
#2 BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE OV2
#3 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE OV3
#4 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE OV4
#5 6D944D54C54ED75D487288FE1505BB59 INSERT OV5
Upvotes: 1