Reputation: 83
I have a .csv that causes different problems with read.table() and fread().
There is an unknown character that causes read.table() to stop (reminiscent of read.csv stops reading at row 523924 even thouhg the file has 799992 rows). Excel, Notepad, and SAS System Viewer render it like a rightwards arrow (although if I use Excel's insert symbol to insert u2192 it appears different); emacs renders it ^Z.
fread() gets past the unknown character (bringing it in as \032) but there is another issue that prevents this from being the solution to my problem: the data set uses quotation marks as an abbreviation for inches, thus embedded (even mismatched) quotes.
Does anyone have any suggestions short of modifying the original .csv file, e.g., by globally replacing the strange arrow?
Thanks in advance!
Upvotes: 1
Views: 1139
Reputation: 42544
In case of Paul's file, I was able to read the file (after some experimentation) using fread()
with the cmd "unzip -cq"
and quote = ""
parameters without error or warnings. I suppose that this might work as well with Kristian's file.
On Windows, it might be necessary to install the Rtools beforehand.
library(data.table) # development version 1.14.1 used
download.file("https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip",
"extrfoia-master-dt2021-07-02.zip")
txt1 <- fread(cmd = "unzip -cq extrfoia-master-dt2021-07-02.zip", quote = "")
Caveat: This will download a file of 38 MBytes
According to the unzip
man page, the -c
option automatically performs ASCII-EBCDIC conversion.
The quote = ""
was required because in at least one case a data field contained double quotes within the text.
I have also tried the -p
option of unzip
which extracts the data without conversion. Then, we can see that there is \032
embedded in the string.
txt2 <- fread(cmd = "unzip -p extrfoia-master-dt2021-07-02.zip", quote = "")
txt2[47096, 1:2]
CUST-ID LEGAL-NAME 1: 1253096 JOHN A. GIANNAKOP\032OULOS
The \032
does not appear in the converted version
txt1[47096, 1:2]
CUST-ID LEGAL-NAME 1: 1253096 JOHN A. GIANNAKOPOULOS
We can search for all occurrences of \032
in all character fields by
melt(txt2, id.vars = "CUST-ID", measure.vars = txt[, names(.SD), .SDcols = is.character])[
value %flike% "\032"][order(`CUST-ID`)]
CUST-ID variable value 1: 1253096 LEGAL-NAME JOHN A. GIANNAKOP\032OULOS 2: 2050751 DBA-NAME colbert ball tax tele\032hone rd 3: 2082166 LEGAL-NAME JUAN DE J. MORALES C\032TALA 4: 2273606 LEGAL-NAME INTRINSIC DM\032 INC. 5: 2300016 MAIL-ADDR1 PO BOX \03209 6: 2346154 LEGAL-NAME JOEL I GONZ\032LEZ-APONTE CPA 7: 2384445 LEGAL-NAME NUMBERS CAF\032 PLLC 8: 2518214 MAIL-ADDR1 556 W 800 N N\03211 9: 2518214 BUSN-ADDR1 556 W 800 N N\03211 10: 13718109 DBA-NAME World Harvest Financial Grou\032 11: 13775763 LEGAL-NAME Fiscally Responsible Consulting LLC\032 12: 13775763 DBA-NAME Fiscally Responsible Consulting LLC\032
This may help to identify the records of the file to fix manually.
Upvotes: 1
Reputation: 2255
I hit this problem today, so its still there in R 4.0.5
The data I'm using is public, from the Internal Revenue service. Somehow the unrecognized characters become "^Z" in the database. So far as I can tell, "^Z" gets inadvertently created when people enter characters that are not recognized by original program that receives. The IRS distributes a CSV file from the database.
In the example file I'm dealing with, there are 13 rows (out of 360,000) that have the ^Z in various spots. Manually deleting them one-at-a-time lets R read.table
get a little further. I found no encoding setting in R that made a difference on this problem.
I found 2 solutions.
Get rid of the "^Z" symbol with text tools before using read.csv.
Switch to Python. The pandas
package function read_csv
, with encoding as "utf-8" correctly obtains all rows. However, in the pandas.DataFrame
that results, the unrecognized character is in the data, it looks like an empty square.
If you want an example to explore, here's the address: https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip. The first "^Z" you find is line 47096.
Upvotes: 0