Li Deborah Jia
Li Deborah Jia

Reputation: 35

pot histograms and had error "missing value where TRUE/FALSE needed"

update:

Turned out it's caused by different classes of variables.

Many thanks to @r2evans, who solved this issue by converting interger64 to numeric when reading the data. His method is effective, but what's worth studying further is his problem-solving logic.

I deleted the data for confidentiality reasons.

Below is the previous question

I plotted histograms of all numeric clomuns in my data table.

head(dt) %>%
  keep(is.numeric) %>% 
  gather() %>% na.omit() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

I chose head() as the data table is too large.

then I had this error:

Error in if (length(unique(intervals)) > 1 & any(diff(scale(intervals)) < : missing value where TRUE/FALSE needed

Then I let

eg <- head(dt)
write.csv2(head(dt), "eg.csv")

and saved eg here on github.

then

eg <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg.csv")

eg %>%
  keep(is.numeric) %>% 
  gather() %>% na.omit() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

I got those right histograms!

What happened when I saved the data and read it again? Or is there a way to fix dt?

PS: dt was also created from saving csv and reading from fread. when I use

eg <- head(dt, 10000)

and save it on github, read again. same error happened.

Is it because my dt is too long (3 million rows) and had some wrong rows?

Upvotes: 2

Views: 767

Answers (1)

r2evans
r2evans

Reputation: 160492

The problem symptom is that two of your fields are appear invariant. After downloading the full data dt:

dt <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv")
dt %>%
  keep(is.numeric) %>% 
  gather() %>%
  na.omit() %>%
  group_by(key) %>%
  summarize(v = var(value))
# Warning: attributes are not identical across measure variables;
# they will be dropped
# # A tibble: 9 x 2
#   key                         v
#   <chr>                   <dbl>
# 1 area_size_high        1.00e18
# 2 area_size_low         3.64e10
# 3 lot_size_high         8.76e17
# 4 lot_size_low          5.60e 5
# 5 price_huf_high        0.         ### problem!
# 6 price_huf_low         0.     
# 7 total_room_count_high 3.23e17
# 8 total_room_count_low  1.46e 0
# 9 V1                    8.33e 6

(Many plots tend to implode when the data is invariant.)

This is confusing, though, because head(dt) definitely shows different values (right side):

          V1         ds                            search_id property_type property_subtype price_huf_low price_huf_high
       <int>     <IDat>                               <char>        <char>           <char>         <i64>          <i64>
    1:     1 2021-02-15 ad2be212-0c25-4e3a-aabf-be089053beba         house             <NA>      45000000       69000000
    2:     2 2021-02-15 ab72ba19-d00f-49e2-8d0d-c6836f030758     apartment             <NA>             0       48000000
    3:     3 2021-02-06 24bbb050-2ecb-4078-a8dc-65e968f72f43     apartment             <NA>     150000000      200000000
    4:     4 2021-02-06 f7d87e6e-0f24-4d9e-ae82-2a448d6290bf     apartment             <NA>       2000000       29000000
    5:     5 2021-02-14 71ea3cc4-5326-4bbe-a2ff-20dbae0d9aa8     apartment             <NA>     200000000      400000000

(truncated).

However, the key to see there is the i64, noting that these are 64-bit integers.

sapply(dt, function(z) class(z)[1])
#                    V1                    ds             search_id         property_type      property_subtype 
#             "integer"               "IDate"           "character"           "character"           "character" 
#         price_huf_low        price_huf_high         area_size_low        area_size_high          lot_size_low 
#           "integer64"           "integer64"             "integer"             "integer"             "integer" 
#         lot_size_high  total_room_count_low total_room_count_high              district 
#             "integer"             "integer"             "integer"           "character" 

You can fix this in one of two ways:

  1. Fix it when you read it in (recommended):

    dt <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv",
                integer64 = "numeric")
    
  2. Fix it with data in your environment:

    ### data.table (since you used `fread`)
    dt[, c("price_huf_low", "price_huf_high") := lapply(.SD, as.numeric),
       .SDcols = c("price_huf_low", "price_huf_high")]
    
    ### or dplyr
    dt %>%
      mutate(across(starts_with("price"), as.numeric)) %>% # ... rest of your pipe
    ### if more than 'price_*' columns:
    dt %>%
      mutate(across(where(~ inherits(., "integer64")), as.numeric)) %>% # ...
    

Either way, once those two columns are converted to numeric, they can be plotted with your original code:

dt %>%
  keep(is.numeric) %>% 
  gather() %>% na.omit() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

proper multi-variable histogram

Upvotes: 2

Related Questions