Kristian Omland
Kristian Omland

Reputation: 83

read.csv - unknown character and embedded quotes

I have a .csv that causes different problems with read.table() and fread().

There is an unknown character that causes read.table() to stop (reminiscent of read.csv stops reading at row 523924 even thouhg the file has 799992 rows). Excel, Notepad, and SAS System Viewer render it like a rightwards arrow (although if I use Excel's insert symbol to insert u2192 it appears different); emacs renders it ^Z.

fread() gets past the unknown character (bringing it in as \032) but there is another issue that prevents this from being the solution to my problem: the data set uses quotation marks as an abbreviation for inches, thus embedded (even mismatched) quotes.

Does anyone have any suggestions short of modifying the original .csv file, e.g., by globally replacing the strange arrow?

Thanks in advance!

Upvotes: 1

Views: 1139

Answers (2)

Uwe
Uwe

Reputation: 42544

In case of Paul's file, I was able to read the file (after some experimentation) using fread() with the cmd "unzip -cq" and quote = "" parameters without error or warnings. I suppose that this might work as well with Kristian's file.

On Windows, it might be necessary to install the Rtools beforehand.

library(data.table) # development version 1.14.1 used
download.file("https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip",
              "extrfoia-master-dt2021-07-02.zip")
txt1 <- fread(cmd = "unzip -cq extrfoia-master-dt2021-07-02.zip", quote = "")

Caveat: This will download a file of 38 MBytes

According to the unzip man page, the -c option automatically performs ASCII-EBCDIC conversion.

The quote = "" was required because in at least one case a data field contained double quotes within the text.


I have also tried the -p option of unzip which extracts the data without conversion. Then, we can see that there is \032 embedded in the string.

txt2 <- fread(cmd = "unzip -p extrfoia-master-dt2021-07-02.zip", quote = "")

txt2[47096, 1:2]
   CUST-ID                 LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOP\032OULOS

The \032 does not appear in the converted version

txt1[47096, 1:2]
   CUST-ID             LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOPOULOS

We can search for all occurrences of \032 in all character fields by

melt(txt2, id.vars = "CUST-ID", measure.vars = txt[, names(.SD), .SDcols = is.character])[
  value %flike% "\032"][order(`CUST-ID`)]
     CUST-ID   variable                                    value
 1:  1253096 LEGAL-NAME               JOHN A. GIANNAKOP\032OULOS
 2:  2050751   DBA-NAME         colbert ball tax tele\032hone rd
 3:  2082166 LEGAL-NAME             JUAN DE J. MORALES C\032TALA
 4:  2273606 LEGAL-NAME                   INTRINSIC DM\032  INC.
 5:  2300016 MAIL-ADDR1                            PO BOX \03209
 6:  2346154 LEGAL-NAME           JOEL I GONZ\032LEZ-APONTE  CPA
 7:  2384445 LEGAL-NAME                    NUMBERS CAF\032  PLLC
 8:  2518214 MAIL-ADDR1                      556 W 800 N N\03211
 9:  2518214 BUSN-ADDR1                      556 W 800 N N\03211
10: 13718109   DBA-NAME         World Harvest Financial Grou\032
11: 13775763 LEGAL-NAME Fiscally Responsible Consulting  LLC\032
12: 13775763   DBA-NAME Fiscally Responsible Consulting  LLC\032

This may help to identify the records of the file to fix manually.

Upvotes: 1

pauljohn32
pauljohn32

Reputation: 2255

I hit this problem today, so its still there in R 4.0.5

The data I'm using is public, from the Internal Revenue service. Somehow the unrecognized characters become "^Z" in the database. So far as I can tell, "^Z" gets inadvertently created when people enter characters that are not recognized by original program that receives. The IRS distributes a CSV file from the database.

In the example file I'm dealing with, there are 13 rows (out of 360,000) that have the ^Z in various spots. Manually deleting them one-at-a-time lets R read.table get a little further. I found no encoding setting in R that made a difference on this problem.

I found 2 solutions.

  1. Get rid of the "^Z" symbol with text tools before using read.csv.

  2. Switch to Python. The pandas package function read_csv, with encoding as "utf-8" correctly obtains all rows. However, in the pandas.DataFrame that results, the unrecognized character is in the data, it looks like an empty square.

If you want an example to explore, here's the address: https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip. The first "^Z" you find is line 47096.

Upvotes: 0

Related Questions