Reputation: 11
I have a set of datapoints relating to different research studies (represented in rows), with columns containing information on country, number of participants etc.
I want to clean up the country data as there are some minor errors e.g. superfluous spaces, different spellings/acronyms etc.
To do so, I created a new variable, country_mod
so I could preserve the original data, using the existing country variable. I checked the levels in this new variable and used this to write some replace functions (as per code below). When I run them, there is no error message, but checking the levels again suggests that nothing has changed and the values haven't been recoded.
e.g. in example below, I was expecting that the "Australia " values would have been replaced with "Australia" - but nothing seems to have happened.
This is a really basic function but I can't for the life of me work out why it's not working - I would really welcome any suggestions as to where I am going wrong.
I've had a look online and can't find any answers to this issue.
Here is my code below - dataset is called studies
; original variable is called Country
; new variable is called country_mod
.
#Create new, modified variable for country
studies$country_mod <- studies$Country
#Check what the different levels are
levels(studies$country_mod)
'Australia' 'Australia ' 'Belgium' 'Canada' 'Denmark' 'Estonia' 'Finland' 'France' 'Germany' 'Greece' 'Hong Kong' 'Hungary' 'Ireland' 'Israel' 'Italy' 'Japan' 'multiple' 'Netherlands' 'New Zealand' 'Norway' 'Poland' 'Portugal' 'Scotland' 'South Korea' 'Spain' 'Spain ' 'Sweden' 'Switzerland' 'Taiwan' 'UK' 'United Kingdom' 'United States' 'United States (Puerto Rico)' 'Uruguay' 'US Virgin Islands' 'USA' 'USA - Puerto Rico'
# Duplicate values for Australia - one has a space in it. Let's recode it.
studies$country_mod[studies$country_mod=="Australia "] <- "Australia"
levels(studies$country_mod)
'Australia' 'Australia ' 'Belgium' 'Canada' 'Denmark' 'Estonia' 'Finland' 'France' 'Germany' 'Greece' 'Hong Kong' 'Hungary' 'Ireland' 'Israel' 'Italy' 'Japan' 'multiple' 'Netherlands' 'New Zealand' 'Norway' 'Poland' 'Portugal' 'Scotland' 'South Korea' 'Spain' 'Spain ' 'Sweden' 'Switzerland' 'Taiwan' 'UK' 'United Kingdom' 'United States' 'United States (Puerto Rico)' 'Uruguay' 'US Virgin Islands' 'USA' 'USA - Puerto Rico'
Upvotes: 0
Views: 30
Reputation: 1378
TLDR, the recoding did work, however, it does not change the "Levels" of your factor "Country_mod". If you call table(studies$Country_mod)
, you will see a table, with the names representing the values possibly taken (all levels of the factor) by the observations in Country_mod
associated with a numeric value indicating how many rows actually have taken that value. So your example will show a 0
beneath "Australia "
after your recode. Similarly, levels(studies$Country_mod)
will still print all the historic "levels" of the factor, irrespective of whether or not any actual observations currently take that value. If, when you're done cleaning up the Country_mod
entries, you use the levels()
function to change the levels that your factor can take, then you will have achieved what I believe you are expecting would be the result from your recoding procedure.
Upvotes: 1