Scott Davis
Scott Davis

Reputation: 993

Problems fixing character vectors in R with gsub()

I have a dataset with headers having "_", ".", and "..." symbols. I tried using the gsub function in rfor removing the text variables, but the symbols still remain.

Here's the function:

#Load dataset
Smallstore1 <- read.csv("/Users/scdavis6/Documents/Work/TowerData/TowerData/Smallclient1.csv", 
                   na.strings = "", head = TRUE)
#Convert csv to data.frame
frame <- as.data.frame(Smallstore1, stringsAsFactors = FALSE)
#Clean up titles of data.frame
gsub("_", "...", ".", Smallstore1)

Here's the names of the dataset:

> names(Smallstore1)
 [1] "user_id"               "email"                 "Age"                   "Gender"               
 [5] "Household.Income"      "Marital.Status"        "Presence.of.Children"  "Home.Owner.Status"    
 [9] "Home.Market.Value"     "Occupation"            "Education"             "Zip.Code"             
[13] "High.Net.Worth"        "Length.of.Residence"   "Arts...Crafts"         "Automotive"           
[17] "Baby.Product.Buyer"    "Beauty"                "Blogging"              "Books"                
[21] "Business"              "Charitable.Donors"     "Cooking"               "Discount.Shopper"     
[25] "Health...Wellness"     "High.End.Brand.Buyer"  "Home...Garden"         "Home.Improvement"     
[29] "Luxury.Goods"          "Magazine.Buyer"        "News...Current.Events" "Outdoor...Adventure"  
[33] "Pets"                  "Sports"                "Technology"            "Travel" 

Please let me know if I can provide more information.

EDIT: I tried two solutions suggested in the comments, but they did not fix the character vectors.

> names(Smallstore1) <- gsub("_|\\.\\.\\.|\\." , "" , names(Smallstore1))
> gsub("_|\\.\\.\\.|\\." , "" , names(Smallstore1))

Instead, I got character vectors with the same numbers.

> names(Smallstore1)
[1] "c(12945, 12947, 12990, 13160, 13195, 13286, 13464, 13501, 13532, 13613, 13660, 13668, 13719, 13776, 13821, 13834, 13858, 13915, 13953, 13977, 14078, 14133, 14157, 14174, 14181, 14187, 14191, 14204, 14276, 14334, 14382, 14439, 14473, 14497, 14507, 14538, 14548, 14555, 14565, 14595, 14620, 14705, 14731, 14752, 14810, 14824, 14827, 14864, 14875, 14983, 14994, 15048, 15096, 15147, 15194, 15234, 15269, 15334, 15381, 15405, 15449, 15453, 15462, 15625, 15646, 15666, 15687, 15708, 15731, 15782, 15823, 15914, \n15935, 16014, 16065, 16095, 16173, 16269, 16289, 16339, 16374, 16408, 16445, 16465, 16527, 16547, 16561, 16581, 16609, 16646, 16677, 16768, 16779, 16792, 16830, 16839, 16849, 17064, 17071, 17149, 17159, 17261, 17346, 17377, 17427, 17428, 17448, 17652, 17737, 17765, 17768, 17808, 17897, 17907, 17910, 17961, 17999, 18122, 18159, 18175, 18397, 18434, 18583, 18635, 18683, 18685, 18713, 18754, 18825, 18839, 18900, 18913, 19040, 19063, 19091, 19144, 19199, 19233, 19308, 19315, 19335, 19366, 19417, 19533, \n19539, 19546, 19553, 19604, 19658, 19669, 19689, 19767, 19791, 19825, 19869, 19998, 20032, 20046, 20107, 20168, 20175, 20287, 20457, 20464, 20481, 20590, 20634, 20647, 20651, 20753, 20783, 20794, 20872, 20967, 21001, 21046, 21110, 21114, 21117, 21191, 21199, 21246, 21253, 21327, 21358, 21409, 21412, 21420, 21480, 21494, 21497, 21508, 21522, 21633, 21637, 21675, 21684, 21698, 21729, 21831, 21847, 21868, 21916, 21950, 21984, 22018, 22021, 22092, 22242, 22249, 22259, 22323, 22364, 22453, 22582, 22606, \n22616, 22619, 22623, 22629, 22630, 22633, 22698, 22776, 22793, 22827, 22891, 22905, 22973, 23010, 23014, 23038, 23052, 23106, 23163, 23173, 23191, 23377, 23388, 23401, 23409, 23466, 23520, 23568, 23670, 23677, 23762, 23823, 23847, 23908, 23925, 23936, 23939, 24017, 24177, 24289, 24313, 24316, 24340, 24395, 24401, 24421, 24480, 24548, 24602, 24718, 24731, 24778, 24833, 24840, 24843, 24877, 24908, 24969, 24990, 25061, 25064, 25224, 25254, 25258, 25268, 25275, 25296, 25367, 25418, 25438, 25445, 25496, \n25553, 25588, 25653, 25707, 25751, 25945, 25989, 26002, 26023, 26057, 26132, 26166)"

Upvotes: 0

Views: 419

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99351

You can just use "[.]" escaping to catch them all. Any punctuation that you put inside "[]" will be protected and does not require escaping. And in gsub, just one . will remove all of the dots in a string ( sub will only remove the first occurrence).

> txt 
#  [1] "user_id"              "email"                "Age"                 
#  [4] "Gender"               "Household.Income"     "Marital.Status"      
#  [7] "Presence.of.Children" "Home.Owner.Status"    "Home.Market.Value"   
# [10] "Occupation"           "Education"            "Zip.Code"            
# [13] "High.Net.Worth"       "Length.of.Residence"  "Arts...Crafts"       
# [16] "Automotive"  
> gsub("[.]|[_]", " ", txt)
#  [1] "user id"              "email"                "Age"                 
#  [4] "Gender"               "Household Income"     "Marital Status"      
#  [7] "Presence of Children" "Home Owner Status"    "Home Market Value"   
# [10] "Occupation"           "Education"            "Zip Code"            
# [13] "High Net Worth"       "Length of Residence"  "Arts   Crafts"       
# [16] "Automotive"      

you could also use "[(.)(_)]" for the matching, and you could drop the parentheses too. But I like them there because it makes the code easier to read.

Link to quality R text processing wiki


I just notice you have gsub("_", "...", ".", Smallstore1) in your code. Best to have a read of ?gsub.

Upvotes: 1

rsoren
rsoren

Reputation: 4216

If you want to clean up the variable names, you need to do something like this:

names(Smallstore1) <- gsub("_|\\.\\.\\.|\\." , "", names(Smallstore1))

Note that this won't affect the frame data frame, because you created it before changing the variable names.

You also used gsub incorrectly. To find out what parameters a function takes, type str(gsub) into the console. help(gsub) gives a fuller explanation.

Upvotes: 0

Related Questions