statsNoob
statsNoob

Reputation: 1355

keeping leading zeros with sqldf

I am a total SQL ignoramus so I apologize if this is very simple..

I have data that contains an ID column consisting of numbers, and in many cases contains leading zeros. I would like to import the data using sqldf, but in doing so I lose the leading zeros for these. Is there a way to keep the leading zeros? Maybe by somehow specifying that all columns are character classes like in R's read.table?

I can't share my data due to the nature of my work, but I am doing something like this:

a <- formatC(sample(1:99, 10), width = 8, format = "d", flag = "0")
fakeDF <- data.frame(v1=a, v2=rnorm(10, 0, 1))
f1 <- tempfile()
write.table(fakeDF, file=f1, quote=FALSE, row.names=FALSE, col.names=FALSE, sep="|")

f2 <- file(f1)
mydat <- sqldf::sqldf("SELECT * FROM f2", dbname=tempfile(), 
                      file.format=list(header=FALSE, sep="|", eol="\n", skip=1))
mydat

Also, I would like to add that the length is not the same for all of these IDs. If possible, I would like to avoid having to manually pad the data with zeros after the fact..

Upvotes: 0

Views: 463

Answers (2)

Parfait
Parfait

Reputation: 107652

One way to run leading zeros is using SQL string functions. Just impose an amount of zeros higher than your desired string length, concatenate with your actual ID field, and strip from the rightmost character the specified length of column you require. Below uses 8 characters as string length:

mydat <- sqldf::sqldf("select rightstr('0000000000000' || ID, 8) As LeadZeroID, 
                              * from f2;", 
                       dbname=tempfile(), 
                       file.format=list(header=FALSE, sep="|", eol="\n", skip=1))

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 269714

Use colClasses like this:

library(sqldf)

read.csv.sql(f1, header = FALSE, sep = "|", colClasses = c("character", "numeric"))

giving:

        V1         V2
1  00000029  1.7150650
2  00000078  0.4609162
3  00000040 -1.2650612
4  00000085 -0.6868529
5  00000090 -0.4456620
6  00000005  1.2240818
7  00000050  0.3598138
8  00000083  0.4007715
9  00000051  0.1106827
10 00000042 -0.5558411

Note: We used the input file generated using this random seed:

set.seed(123)

a <- formatC(sample(1:99, 10), width = 8, format = "d", flag = "0")
fakeDF <- data.frame(v1=a, v2=rnorm(10, 0, 1))
f1 <- tempfile()
write.table(fakeDF, file=f1, quote=FALSE, row.names=FALSE, col.names=FALSE, sep="|")

Upvotes: 2

Related Questions