Reputation: 37
I cannot successfully load the data contained in the Qld+20-34+Age+Groups.zip file which can be found at...
https://github.com/SuperSi2217/datasample
I've opened the file in a txt editor and deleted the unwanted header and tail rows. I've tried various read_csv
and read.csv
combinations to import it, but it always pulls in an extra column at the end of the data set which is populated with NA
s. I've tried converting it to a text file and using read_delim
and read.table
, and still get the same issues.
df <- read_delim("C:/Qld 20-34 Age Groups Clean.txt", col_names = FALSE, quote = "\"", na = c("", "NA"), delim = ",")
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X3 = col_integer(),
X4 = col_integer(),
X5 = col_integer(),
X6 = col_integer(),
X7 = col_character()
)
Warning: 1 parsing failure.
row
# A tibble: 1 x 5
col row col expected actual expected
<int> <chr> <chr> <chr>
actual 1 1423530 <NA> 7 columns 6 columns file
# ... with 1 more variables: file <chr>
df <- read_delim("C:/Qld 20-34 Age Groups Clean.txt", delim = ",", col_names = FALSE, quote = "\"", na = c("", "NA"))
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X3 = col_integer(),
X4 = col_integer(),
X5 = col_integer(),
X6 = col_integer(),
X7 = col_character()
)
|========================================================| 100% 29 MB
df <- read_csv("C:/qldtest.csv", col_names = TRUE)
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X6 = col_integer()
)
The above imports the data but with an extra column. When I try to work on it, it does weird things - see below. To get it down to the three columns I need I use...
df <- df %>%
select(X1, X2, X6)
Ultimately I need the data to look something like...
X1 | X2 | X6
----------|----------------|------
Abbotsbury|4032,QLD |0
na |4033,QLD |0
na |4034,QLD |10
na |4035,QLD |0
Smith Town|4032,QLD |0
na |4033,QLD |220
na |4034,QLD |0
na |4035,QLD |0
I then run...
transform(df, X1 = na.locf(Suburb))
...to fill down the last know value in the first column so it becomes...
X1 | X2 | X6
----------|----------------|------
Abbotsbury|4032,QLD |0
Abbotsbury|4033,QLD |0
Abbotsbury|4034,QLD |10
Abbotsbury|4035,QLD |0
Smith Town|4032,QLD |0
Smith Town|4033,QLD |220
Smith Town|4034,QLD |0
Smith Town|4035,QLD |0
This works OK but with the following warning...
+ transform(df, X1 = na.locf(df))
Warning messages:
1: In is.na(object) :
is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(object[1L]) :
is.na() applied to non-(list or vector) of type 'NULL'
That said, the data looks right.
However, when I run the following to select only those records where the X6 column is > 0, R visibly adds in another four columns, yet the variable count in the global environment still says 3??
df1 <- df %>%
filter(X6 > 0)
...these look like
X1.X1.X1 |X1.X1.X2|X1.X1.X6|X1.X2 |X1.X6|X2 |X6
----------|--------|--------|--------|-----|--------|--
Abbotsbury|4613,QLD|3 |4613,QLD|3 |4613,QLD|3
What am I doing wrong? Any help appreciated.
The first few rows of the file look like the attached image.
Upvotes: 0
Views: 499
Reputation: 6813
If you open the file in a text editor like Sublime, you will see that there is a comma after every row:
This is why there is an extra column.
I would assume that you don't need the information above the data, so I would suggest to use skip = 11
to read the data. Since there is a disclaimer below the data, you could use n_max
to exclude it by limiting the number of rows read.
library(readr)
file <- "C:/Qld 20-34 Age Groups Clean.txt"
df <- read_delim(file, col_names = FALSE, quote = "\"", na = c("", "NA"),
delim = ",", skip = 11, n_max = 1423540)
df$X7 <- NULL
head(df, n = 5)
# A tibble: 5 x 6
X1 X2 X3 X4 X5 X6
<chr> <chr> <int> <int> <int> <int>
1 Abbeywood 4000, QLD 0 0 0 0
2 <NA> 4005, QLD 0 0 0 0
3 <NA> 4006, QLD 0 0 0 0
4 <NA> 4007, QLD 0 0 0 0
5 <NA> 4008, QLD 0 0 0 0
In order to replace NAs with the most recent non-NAs, you can use
df <- df %>%
mutate(X1 = na.locf(df$X1))
head(df, n = 5)
# A tibble: 5 x 6
X1 X2 X3 X4 X5 X6
<chr> <chr> <int> <int> <int> <int>
1 Abbeywood 4000, QLD 0 0 0 0
2 <NA> 4005, QLD 0 0 0 0
3 <NA> 4006, QLD 0 0 0 0
4 <NA> 4007, QLD 0 0 0 0
5 <NA> 4008, QLD 0 0 0 0
Upvotes: 1
Reputation: 11772
What about just skipping the first 9 rows and using the normal header of the file?
Something like this:
jnk <-
read.csv('~/Downloads/Qld 20-34 Age Groups.csv', skip=9, stringsAsFactors=FALSE)
You can have a look with
summary(jnk)
and for example use your df %>% filter(X6 > 0)
command which would look like this
head(jnk %>% filter(Total > 0))
Or did I miss some important point in the question?
Upvotes: 0