R: Issues reading in tab delimited files

Question

Apologies in advance for the simple question. I am having trouble reading a tab delimited file. R contends that there are missing elements on line 164 but I cannot see why. When I copy and paste into Excel, it separates just fine.

Data:

  temp <- tempfile()
  download.file("https://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM527389.zip",temp)

I have tried

df <- read.table(unz(temp, "Products.txt"), sep="	",header= TRUE)

and

 df <- read.table(unz(temp, "Products.txt"), sep="	",fill=TRUE, header= TRUE)

Which messes up on the same line.

Parfait · Accepted Answer

Consider read.delim which like read.csv is among the wrappers to the more general read.table function in built-in utils package.

It appears the longer fields, DrugName and ActiveIngredient, have issues with quotes and blank lines, requiring the fill, quote, comment_char arguments to be adjusted.

df <- read.delim(unz(temp, "Products.txt"), sep="	", header= TRUE)

With structure output:

str(df)
# 'data.frame': 37850 obs. of  8 variables:
#  $ ApplNo           : int  4 159 552 552 552 552 552 552 552 552 ...
#  $ ProductNo        : num  4 1 1 2 3 4 5 7 8 9 ...
#  $ Form             : Factor w/ 348 levels "AEROSOL, FOAM;RECTAL",..: 203 331 121 121 121 121 121 121 121 121 ...
#  $ Strength         : Factor w/ 4065 levels ""," EQ 5MG BASE/ML",..: 525 2491 1453 2240 2447 538 654 670 538 2447 ...
#  $ ReferenceDrug    : int  0 0 0 0 0 0 0 0 0 0 ...
#  $ DrugName         : Factor w/ 7161 levels "8-HOUR BAYER",..: 4773 6039 3547 3547 3547 3547 3547 3546 2796 2796 ...
#  $ ActiveIngredient : Factor w/ 2735 levels "ABACAVIR SULFATE",..: 1372 2446 1305 1305 1305 1305 1305 1305 1305 1305 ...
#  $ ReferenceStandard: int  0 0 0 0 0 0 0 0 0 0 ...

Equivalently with read.table, adjusting default values in arguments:

df <- read.table(unz(temp, "Products.txt"), sep="	", quote = """, fill = TRUE,
                 comment.char = "", header= TRUE)

For comparison:

df1 <- read.table(unz(temp, "Products.txt"), sep="	", quote = """, fill = TRUE, 
                  comment.char = "", header= TRUE) 
df2 <- read.delim(unz(temp, "Products.txt"), sep="	", header= TRUE) 

all.equal(df1, df2)
# [1] TRUE

identical(df1, df2)
# [1] TRUE

R: Issues reading in tab delimited files

Answers (1)

Related Questions