Reputation: 23
The house price dataset has a large number of variables with few having many missing values.
I want to find number of missing values for each variable.
But due to the large number of variables, the data sometimes eludes the eye.
(Below is just sample dataset. Actual has about 80 variables.)
> sapply(filtered_data, function(x) sum(is.na(x)))
Id Building_Class Zoning_Class
0 0 0
Lot_Extent Lot_Size Property_Shape
259 0 0
Garage Garage_Built_Year Garage_Finish_Year
81 81 81
Garage_Size Garage_Area Garage_Quality
0 0 81
Garage_Condition Pavedd_Drive W_Deck_Area
81 0 0
Screen_Lobby_Area Pool_Area Fence_Quality
0 0 1178
Hence I want to create a small function that prints the column name along with the count of NA.
I tried the below.
for (x in filtered_data){
if (sum(is.na(x)>0)){
print(sum(is.na(x)))
print(colnames(x))
}
}
However the result is:
[1] 259
NULL
[1] 8
NULL
[1] 8
NULL
[1] 37
NULL
[1] 37
NULL
[1] 38
NULL
[1] 37
NULL
Is there a way to print something like:
Lot_Extent: 259
Garage: 81
Garage_Built_Year: 81
and so on...
Upvotes: 0
Views: 664
Reputation: 1810
namedCounts <- sapply(filtered_data, function(x) sum(is.na(x)))
namedCounts <- namedCounts[namedCounts>0]
print(paste0(names(namedCounts)," :",unname(namedCounts)))
Upvotes: 1
Reputation: 389175
Here is one vectorised option :
data <- colSums(is.na(filtered_data))
cat(paste(names(data), data, sep = ' : ', collapse = '\n'))
Upvotes: 1