Niek de Klein
Niek de Klein

Reputation: 8824

When I read in a large table using fread it slightly changes the numbers in one of the columns

I have a large file that looks like this

region              type    coeff      p-value  distance    count
82365593523656436   A      -0.9494     0.050    -16479472.5 8
82365593523656436   B      0.47303     0.526    57815363.0  8
82365593523656436   C      -0.8938     0.106    42848210.5  8

When I read it in using fread, suddenly 82365593523656436 is not found anymore

correlations <- data.frame(fread('all_to_all_correlations.txt'))
> "82365593523656436" %in% correlations$region
[1] FALSE

I can find a slightly different number

> "82365593523656432" %in% correlations$region
[1] TRUE

but this number is not in the actual file

grep 82365593523656432 all_to_all_correlations.txt 

gives no results, while

grep 82365593523656436 all_to_all_correlations.txt 

does.

When I try to read in the small sample file I showed above instead of the full file I get

Warning message:
In fread("test.txt") :
  Some columns have been read as type 'integer64' but package bit64 isn't  loaded. 
Those columns will display as strange looking floating point data. 
There is no need to reload the data. 
Just require(bit64) toobtain the integer64 print method and print the data again.

and the data looks like

     region type    coeff       p.value  distance      count
1 3.758823e-303    A -0.94940   0.050    -16479472     8
2 3.758823e-303    B  0.47303   0.526     57815363     8
3 3.758823e-303    C -0.89380   0.106     42848210     8

So I think during reading 82365593523656436 was changed into 82365593523656432. How can I prevent this from happening?

Upvotes: 1

Views: 1784

Answers (1)

Roland
Roland

Reputation: 132706

IDs (and that's apparently what the first column is) should usually be read as characters:

correlations <- setDF(fread('region              type    coeff      p-value  distance    count
                                 82365593523656436   A      -0.9494     0.050    -16479472.5 8
                                 82365593523656436   B      0.47303     0.526    57815363.0  8
                                 82365593523656436   C      -0.8938     0.106    42848210.5  8',
                            colClasses = c(region = "character")))
str(correlations)
#'data.frame':  3 obs. of  6 variables:
# $ region  : chr  "82365593523656436" "82365593523656436" "82365593523656436"
# $ type    : chr  "A" "B" "C"
# $ coeff   : num  -0.949 0.473 -0.894
# $ p-value : num  0.05 0.526 0.106
# $ distance: num  -16479473 57815363 42848211
# $ count   : int  8 8 8

Upvotes: 1

Related Questions