Issues loading a file in R

Question

I cannot successfully load the data contained in the Qld+20-34+Age+Groups.zip file which can be found at...

https://github.com/SuperSi2217/datasample

I've opened the file in a txt editor and deleted the unwanted header and tail rows. I've tried various read_csv and read.csv combinations to import it, but it always pulls in an extra column at the end of the data set which is populated with NAs. I've tried converting it to a text file and using read_delim and read.table, and still get the same issues.

df <- read_delim("C:/Qld 20-34 Age Groups Clean.txt", col_names = FALSE, quote = "\"", na = c("", "NA"), delim = ",")
Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X3 = col_integer(),
  X4 = col_integer(),
  X5 = col_integer(),
  X6 = col_integer(),
  X7 = col_character()
)
Warning: 1 parsing failure.
row 
# A tibble: 1 x 5 
col       row   col  expected    actual expected     
            
actual 1 1423530   7 columns 6 columns file 
# ... with 1 more variables: file 

df <- read_delim("C:/Qld 20-34 Age Groups Clean.txt", delim = ",", col_names = FALSE, quote = "\"", na = c("", "NA"))
Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X3 = col_integer(),
  X4 = col_integer(),
  X5 = col_integer(),
  X6 = col_integer(),
  X7 = col_character()
)
|========================================================| 100%   29 MB

df <- read_csv("C:/qldtest.csv", col_names = TRUE)
Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X6 = col_integer()
)

The above imports the data but with an extra column. When I try to work on it, it does weird things - see below. To get it down to the three columns I need I use...

df <- df %>% 
       select(X1, X2, X6)

Ultimately I need the data to look something like...

X1    | X2 | X6
----------|----------------|------
Abbotsbury|4032,QLD        |0
na        |4033,QLD        |0
na        |4034,QLD        |10
na        |4035,QLD        |0
Smith Town|4032,QLD        |0
na        |4033,QLD        |220
na        |4034,QLD        |0
na        |4035,QLD        |0

I then run...

transform(df, X1 = na.locf(Suburb))

...to fill down the last know value in the first column so it becomes...

X1    | X2 | X6
----------|----------------|------
Abbotsbury|4032,QLD        |0
Abbotsbury|4033,QLD        |0
Abbotsbury|4034,QLD        |10
Abbotsbury|4035,QLD        |0
Smith Town|4032,QLD        |0
Smith Town|4033,QLD        |220
Smith Town|4034,QLD        |0
Smith Town|4035,QLD        |0

This works OK but with the following warning...

+ transform(df, X1 = na.locf(df))
Warning messages:
1: In is.na(object) :
  is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(object[1L]) :
  is.na() applied to non-(list or vector) of type 'NULL'

That said, the data looks right.

However, when I run the following to select only those records where the X6 column is > 0, R visibly adds in another four columns, yet the variable count in the global environment still says 3??

df1 <- df %>%
        filter(X6 > 0)

...these look like

X1.X1.X1  |X1.X1.X2|X1.X1.X6|X1.X2   |X1.X6|X2      |X6
----------|--------|--------|--------|-----|--------|--
Abbotsbury|4613,QLD|3       |4613,QLD|3    |4613,QLD|3

What am I doing wrong? Any help appreciated.

The first few rows of the file look like the attached image.

clemens · Accepted Answer

If you open the file in a text editor like Sublime, you will see that there is a comma after every row:

This is why there is an extra column.

I would assume that you don't need the information above the data, so I would suggest to use skip = 11 to read the data. Since there is a disclaimer below the data, you could use n_max to exclude it by limiting the number of rows read.

library(readr)
file <- "C:/Qld 20-34 Age Groups Clean.txt"
df <- read_delim(file, col_names = FALSE, quote = "\"", na = c("", "NA"), 
                 delim = ",", skip = 11, n_max = 1423540)
df$X7 <- NULL
head(df, n = 5)
# A tibble: 5 x 6
     X1        X2    X3    X4    X5    X6
               
1 Abbeywood 4000, QLD     0     0     0     0
2       4005, QLD     0     0     0     0
3       4006, QLD     0     0     0     0
4       4007, QLD     0     0     0     0
5       4008, QLD     0     0     0     0

In order to replace NAs with the most recent non-NAs, you can use

df <- df %>% 
    mutate(X1 = na.locf(df$X1))

head(df, n = 5)
# A tibble: 5 x 6
         X1        X2    X3    X4    X5    X6
               
1 Abbeywood 4000, QLD     0     0     0     0
2       4005, QLD     0     0     0     0
3       4006, QLD     0     0     0     0
4       4007, QLD     0     0     0     0
5       4008, QLD     0     0     0     0

Issues loading a file in R

Answers (2)

Related Questions