Reputation: 495
I have an R data frame as below
Date @AD.CC_CC @AD.CC_CC.1 @CL.CC_CC @CL.CC_CC.1
2018-02-05 -380 -380 -1580 -1580
2018-02-06 20 20 -280 -280
2018-02-07 -700 -700 -1730 -1730
2018-02-08 -460 -460 -1100 -1100
2018-02-09 260 260 -1780 -1780
2018-02-12 480 480 380 380
I use the copy_to function to copy the dataframe to Spark. After conversion it converts all the rows to double.
# Source: lazy query [?? x 5]
# Database: spark_connection
Date AD_CC_CC AD_CC_CC_1 CL_CC_CC CL_CC_CC_1
<dbl> <dbl> <dbl> <dbl> <dbl>
17567 -380 -380 -1580 -1580
17568 20 20 -280 -280
17569 -700 -700 -1730 -1730
17570 -460 -460 -1100 -1100
17571 260 260 -1780 -1780
17574 480 480 380 380
I am trying to convert it back to Date using the below command but throws an error.
marketdata_spark %>% mutate(Date = as.Date(Date))
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(marketdata.`Date` AS DATE)' due to data type mismatch: cannot cast double to date; line 1 pos 59;
Not sure what to do.
Upvotes: 1
Views: 1439
Reputation: 330343
This looks like a sparklyr
bug. The simplest workaround is to cast dates to character, before calling copy_to
:
df <- tibble::tibble(Date=as.Date(c("2018-02-05", "2018-02-06")))
sdf <- df %>% mutate(Date = as.character(Date)) %>% copy_to(sc, .)
sdf
# Source: table<sparklyr_11ae23aa677e> [?? x 1]
# Database: spark_connection
Date
<chr>
1 2018-02-05
2 2018-02-06
and casting it later:
sdf %>% mutate(Date = to_date(Date))
# Source: lazy query [?? x 1]
# Database: spark_connection
Date
<date>
1 2018-02-05
2 2018-02-06
You can also try using the numeric value as an offset since beginning of the Unix epoch:
sdf <- df %>% copy_to(sc, .)
sdf
# Source: table<sparklyr_13ab19ec6f53> [?? x 1]
# Database: spark_connection
Date
<dbl>
1 17567
2 17568
sdf %>% mutate(Date = date_add(to_date("1970-01-01"), Date))
# Source: lazy query [?? x 1]
# Database: spark_connection
Date
<date>
1 2018-02-05
2 2018-02-06
Alternatively, you can skip copy_to
completely (it has very limited applications anyway, and is seldom useful in production) and use one of built-in input formats (spark_read_*
).
Upvotes: 4