Reputation: 35
update:
Turned out it's caused by different classes of variables.
Many thanks to @r2evans, who solved this issue by converting interger64 to numeric when reading the data. His method is effective, but what's worth studying further is his problem-solving logic.
I deleted the data for confidentiality reasons.
I plotted histograms of all numeric clomuns in my data table.
head(dt) %>%
keep(is.numeric) %>%
gather() %>% na.omit() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
I chose head() as the data table is too large.
then I had this error:
Error in if (length(unique(intervals)) > 1 & any(diff(scale(intervals)) < : missing value where TRUE/FALSE needed
Then I let
eg <- head(dt)
write.csv2(head(dt), "eg.csv")
and saved eg here on github.
then
eg <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg.csv")
eg %>%
keep(is.numeric) %>%
gather() %>% na.omit() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
I got those right histograms!
What happened when I saved the data and read it again? Or is there a way to fix dt?
PS: dt was also created from saving csv and reading from fread. when I use
eg <- head(dt, 10000)
and save it on github, read again. same error happened.
Is it because my dt is too long (3 million rows) and had some wrong rows?
Upvotes: 2
Views: 767
Reputation: 160492
The problem symptom is that two of your fields are appear invariant. After downloading the full data dt
:
dt <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv")
dt %>%
keep(is.numeric) %>%
gather() %>%
na.omit() %>%
group_by(key) %>%
summarize(v = var(value))
# Warning: attributes are not identical across measure variables;
# they will be dropped
# # A tibble: 9 x 2
# key v
# <chr> <dbl>
# 1 area_size_high 1.00e18
# 2 area_size_low 3.64e10
# 3 lot_size_high 8.76e17
# 4 lot_size_low 5.60e 5
# 5 price_huf_high 0. ### problem!
# 6 price_huf_low 0.
# 7 total_room_count_high 3.23e17
# 8 total_room_count_low 1.46e 0
# 9 V1 8.33e 6
(Many plots tend to implode when the data is invariant.)
This is confusing, though, because head(dt)
definitely shows different values (right side):
V1 ds search_id property_type property_subtype price_huf_low price_huf_high
<int> <IDat> <char> <char> <char> <i64> <i64>
1: 1 2021-02-15 ad2be212-0c25-4e3a-aabf-be089053beba house <NA> 45000000 69000000
2: 2 2021-02-15 ab72ba19-d00f-49e2-8d0d-c6836f030758 apartment <NA> 0 48000000
3: 3 2021-02-06 24bbb050-2ecb-4078-a8dc-65e968f72f43 apartment <NA> 150000000 200000000
4: 4 2021-02-06 f7d87e6e-0f24-4d9e-ae82-2a448d6290bf apartment <NA> 2000000 29000000
5: 5 2021-02-14 71ea3cc4-5326-4bbe-a2ff-20dbae0d9aa8 apartment <NA> 200000000 400000000
(truncated).
However, the key to see there is the i64
, noting that these are 64-bit integers.
sapply(dt, function(z) class(z)[1])
# V1 ds search_id property_type property_subtype
# "integer" "IDate" "character" "character" "character"
# price_huf_low price_huf_high area_size_low area_size_high lot_size_low
# "integer64" "integer64" "integer" "integer" "integer"
# lot_size_high total_room_count_low total_room_count_high district
# "integer" "integer" "integer" "character"
You can fix this in one of two ways:
Fix it when you read it in (recommended):
dt <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv",
integer64 = "numeric")
Fix it with data in your environment:
### data.table (since you used `fread`)
dt[, c("price_huf_low", "price_huf_high") := lapply(.SD, as.numeric),
.SDcols = c("price_huf_low", "price_huf_high")]
### or dplyr
dt %>%
mutate(across(starts_with("price"), as.numeric)) %>% # ... rest of your pipe
### if more than 'price_*' columns:
dt %>%
mutate(across(where(~ inherits(., "integer64")), as.numeric)) %>% # ...
Either way, once those two columns are converted to numeric
, they can be plotted with your original code:
dt %>%
keep(is.numeric) %>%
gather() %>% na.omit() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Upvotes: 2