Avi
Avi

Reputation: 2283

Error in read.table duplicate row.names

When I tried to read the following table into dataframe (data100) by:

data100 <- read.table(header=TRUE, text='
                                 verb_object SESSION_ID
1:   BA31C1CC63E5043483FAE25F085E25E5 INSERT   41595370
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE   41595371
3:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282
4:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282
5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE   41595373
6:   6D944D54C54ED75D487288FE1505BB59 INSERT   41595368
')

I get the following error:
Error in read.table(header = TRUE, text = "\n                               verb_object SESSION_ID\n   BA31C1CC63E5043483FAE25F085E25E5 INSERT   41595370\n                      BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE   41595371\n                         26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282\n                         26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282\n                     2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE   41595373\n                         6D944D54C54ED75D487288FE1505BB59 INSERT   41595368\n") : 
  duplicate 'row.names' are not allowed

How can I read it?

After usage of

lines <- readLines(textConnection("       verb_object SESSION_ID



> data100<-read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=TRUE), sep='', fill=TRUE)

The result is as followed:

> data100
           V1                               V2       V3       V4 V5                                         V6       V7
1 verb_object                       SESSION_ID                NA                                                     NA
2         1:  BA31C1CC63E5043483FAE25F085E25E5   INSERT 41595370 2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE  41595371
3         3:  26D695C8CA82CAFFDF985201F3AA44D7   UPDATE 41595282 4:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE  41595282
4         5:  2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373 6:   6D944D54C54ED75D487288FE1505BB59 INSERT  41595368
> 

Upvotes: 1

Views: 700

Answers (1)

akrun
akrun

Reputation: 886948

We can read it with readLines, place the quotes using gsub, and read with read.table

lines <- readLines(textConnection("verb_object SESSION_ID
1:   BA31C1CC63E5043483FAE25F085E25E5 INSERT   41595370
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE   41595371
3:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282
4:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282
5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE   41595373
6:   6D944D54C54ED75D487288FE1505BB59 INSERT   41595368"))



read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=TRUE), sep='')
#                                  verb_object SESSION_ID
#1:   BA31C1CC63E5043483FAE25F085E25E5 INSERT    41595370
#2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE    41595371
#3:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE    41595282
#4:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE    41595282
#5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE    41595373
#6:   6D944D54C54ED75D487288FE1505BB59 INSERT    41595368

Update

The OP's new dataset can be read with readLines as before,

lines <- readLines(textConnection("items newitem
1: BA31C1CC63E5043483FAE25F085E25E5 INSERT OV1
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE OV2
3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE OV3
4: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE OV4
5: 6D944D54C54ED75D487288FE1505BB59 INSERT OV5"))   

We should note that the pattern we matched in the earlier dataset (\\s+(?=\\s[0-9])) won't work here as the first character in 'SESSIONID' is number, while in 'newitem' it is a uppercase letter. So, we match one or more characters that are not : from the beginning of the string (^[^:]+) followed by :, followed by one or more space (\\s+), then we capture the characters as a group using parentheses () i.e. one or more characters that are not space followed by one or more space and characters not space (([^ ]+\\s+[^ ]+), match one or more space (\\s+) followed by one or more characters till the end of the string as another capture group ((.*)$). We replace by placing quotes around the first capture group ('\\1') followed by space followed by second capture group.

read.table(text=gsub("^[^:]+:\\s+([^ ]+\\s+[^ ]+)\\s+(.*)$",
         "'\\1' \\2", lines), header=TRUE)
#                                     items newitem
#1   BA31C1CC63E5043483FAE25F085E25E5 INSERT     OV1
#2 BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE     OV2
#3   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE     OV3
#4 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE     OV4
#5   6D944D54C54ED75D487288FE1505BB59 INSERT     OV5

Upvotes: 1

Related Questions