Reputation: 27
I have a string variable in Stata with entries like the following
var1 12.34 var2 23.45 var3 34.56
var2 54.32 var3 67.32
var1 32.32 var3 23.45
I want to create three variables with the above string variable by extracting only the numbers (variable names var1
, var2
and var3
). However, not all entries have all three variables. So, if I simply extract the numbers by parsing, entry 2 will have var2
's value under var1
's value and var3
's value under var2
's value.
Is there a way to make all the entries uniform such that if var2
is missing, the entry needs to be replaced with "var2 0"
in between var1
and var3
and if var1
is missing, the entry needs to be replaced with "var1 0"
before var2
.
Upvotes: 0
Views: 455
Reputation: 37278
I am ignoring the question as inserting words (really two words) isn't needed at all to parse your input as you wish. Indeed, inserting words would be awkward as you need to find which words are present before you can determine which to insert (and where to insert them).
Here's one way to do it. You will need to change 3 wired into this example code if you really have more variables than 3 packed inside your data variable.
* Example generated by -dataex-. For more info, type help dataex
clear
input str32 data
"var1 12.34 var2 23.45 var3 34.56"
"var2 54.32 var3 67.32 "
"var1 32.32 var3 23.45"
end
split data, destring
* initialize to 0
forval j = 1/3 {
gen var`j' = 0
}
local K = r(k_new) - 1
* look in variables 1, 3, 5, ... for where the values go
* look in variables 2, 4, 6, ... for what the values are
forval k = 1(2)`K' {
local kP1 = `k' + 1
forval j = 1/3 {
replace var`j' = data`kP1' if data`k' == "var`j'"
}
}
. l data1-var3
+-----------------------------------------------------------------------+
| data1 data2 data3 data4 data5 data6 var1 var2 var3 |
|-----------------------------------------------------------------------|
1. | var1 12.34 var2 23.45 var3 34.56 12.34 23.45 34.56 |
2. | var2 54.32 var3 67.32 . 0 54.32 67.32 |
3. | var1 32.32 var3 23.45 . 32.32 0 23.45 |
+-----------------------------------------------------------------------+
Here is another way to do it that doesn't wire in the number of variables or what they are named.
* Example generated by -dataex-. For more info, type help dataex
clear
input str32 data
"var1 12.34 var2 23.45 var3 34.56"
"var2 54.32 var3 67.32 "
"var1 32.32 var3 23.45"
end
split data
drop data
gen long id = _n
reshape long data, i(id)
gen which = data[_n-1] if real(data) < .
keep if !missing(which)
drop _j
reshape wide data, i(id) j(which) string
rename (data*) (*)
ds id, not
local vars `r(varlist)'
destring `vars', replace
mvencode `vars', mv(0)
list
Upvotes: 1
Reputation: 3255
Here is a similar but more generalizable solution where the number of variables does not need to be set manually, and the names do not need to have an iterable naming convention.
clear
input str32 data
"aaa 12.34 var 23.45 yyy 34.56"
"var 54.32 yyy 67.32 "
"aaa 32.32 yyy 23.45"
end
* Update this local to the name of the variable with data in strings
local datavar data
* Split data in string
split `datavar', destring
local k = r(k_new)
* Get a list of all variables used in any row
local datavars ""
forvalue i = 1(2)`k' {
levelsof `datavar'`i', local(vars) clean
local datavars : list datavars | vars
}
* Loop over all the variables to be created
foreach var of local datavars {
* Initiate variable
gen `var' = 0
* Copy values for that value
forvalue i = 1(2)`k' {
local j = `i' + 1
replace `var' = `datavar'`j' if `datavar'`i' == "`var'"
}
}
Upvotes: 1