ejt43
ejt43

Reputation: 11

Replacing values using conditional

I have a set of datapoints relating to different research studies (represented in rows), with columns containing information on country, number of participants etc.

I want to clean up the country data as there are some minor errors e.g. superfluous spaces, different spellings/acronyms etc.

To do so, I created a new variable, country_mod so I could preserve the original data, using the existing country variable. I checked the levels in this new variable and used this to write some replace functions (as per code below). When I run them, there is no error message, but checking the levels again suggests that nothing has changed and the values haven't been recoded.

e.g. in example below, I was expecting that the "Australia " values would have been replaced with "Australia" - but nothing seems to have happened.

This is a really basic function but I can't for the life of me work out why it's not working - I would really welcome any suggestions as to where I am going wrong.

I've had a look online and can't find any answers to this issue.

Here is my code below - dataset is called studies; original variable is called Country; new variable is called country_mod.

#Create new, modified variable for country
studies$country_mod <- studies$Country

#Check what the different levels are

levels(studies$country_mod)

 'Australia' 'Australia ' 'Belgium' 'Canada' 'Denmark' 'Estonia' 'Finland' 'France' 'Germany' 'Greece' 'Hong Kong' 'Hungary' 'Ireland' 'Israel' 'Italy' 'Japan' 'multiple' 'Netherlands' 'New Zealand' 'Norway' 'Poland' 'Portugal' 'Scotland' 'South Korea' 'Spain' 'Spain ' 'Sweden' 'Switzerland' 'Taiwan' 'UK' 'United Kingdom' 'United States' 'United States (Puerto Rico)' 'Uruguay' 'US Virgin Islands' 'USA' 'USA - Puerto Rico' 

# Duplicate values for Australia - one has a space in it. Let's recode it.

studies$country_mod[studies$country_mod=="Australia "] <- "Australia"

levels(studies$country_mod)

 'Australia' 'Australia ' 'Belgium' 'Canada' 'Denmark' 'Estonia' 'Finland' 'France' 'Germany' 'Greece' 'Hong Kong' 'Hungary' 'Ireland' 'Israel' 'Italy' 'Japan' 'multiple' 'Netherlands' 'New Zealand' 'Norway' 'Poland' 'Portugal' 'Scotland' 'South Korea' 'Spain' 'Spain ' 'Sweden' 'Switzerland' 'Taiwan' 'UK' 'United Kingdom' 'United States' 'United States (Puerto Rico)' 'Uruguay' 'US Virgin Islands' 'USA' 'USA - Puerto Rico'

Upvotes: 0

Views: 30

Answers (1)

Dij
Dij

Reputation: 1378

TLDR, the recoding did work, however, it does not change the "Levels" of your factor "Country_mod". If you call table(studies$Country_mod), you will see a table, with the names representing the values possibly taken (all levels of the factor) by the observations in Country_mod associated with a numeric value indicating how many rows actually have taken that value. So your example will show a 0 beneath "Australia " after your recode. Similarly, levels(studies$Country_mod) will still print all the historic "levels" of the factor, irrespective of whether or not any actual observations currently take that value. If, when you're done cleaning up the Country_mod entries, you use the levels() function to change the levels that your factor can take, then you will have achieved what I believe you are expecting would be the result from your recoding procedure.

Upvotes: 1

Related Questions