Reputation: 1097
This question is a follow-up question from Stata: replace, if, forvalues. Consider this data:
set seed 123456
set obs 5000
g firmid = "firm" + string(_n) /* Observation (firm) id */
g nw = floor(100*runiform()) /* Number of workers in a firm */
g double lat = 39+runiform() /* Latitude in decimal degree of a firm */
g double lon = -76+runiform() /* Longitude in decimal degree of a firm */
The first 10 observations are:
+--------------------------------------+
| firmid nw lat lon |
|--------------------------------------|
1. | firm1 81 39.915526 -75.505018 |
2. | firm2 35 39.548523 -75.201567 |
3. | firm3 10 39.657866 -75.17988 |
4. | firm4 83 39.957938 -75.898837 |
5. | firm5 56 39.575881 -75.169157 |
6. | firm6 73 39.886184 -75.857255 |
7. | firm7 27 39.33288 -75.724665 |
8. | firm8 75 39.165549 -75.96502 |
9. | firm9 64 39.688819 -75.232764 |
10. | firm10 76 39.012228 -75.166272 |
+--------------------------------------+
I need to calculate the distances between firm 1 and all other firms. So, the vincenty command looks like:
. scalar theLat = 39.915526
. scalar theLon = -75.505018
. vincenty lat lon theLat theLon, hav(distance_km) inkm
The vincenty command creates the distance_km variable that has distances between each observation and firm 1. Here, I manually copy and paste the two numbers that are 39.915526 and -75.505018.
Question 1: What's the syntax that extracts those numbers?
Now, I can keep observations where distances_km <= 2. And,
. egen near_nw_sum = sum(nw)
will create the sum of workers within 2 kilometers of the firm 1. (Or, the collapse command may do the job.)
Question 2: I have to do this for all firms, and the final data should look like:
+-----------------------------------------------------------------+
| firmid nw lat lon near_nw_sum |
|-----------------------------------------------------------------|
1. | firm1 81 39.915526 -75.505018 (# workers near firm1) |
2. | firm2 35 39.548523 -75.201567 (# workers near firm2) |
3. | firm3 10 39.657866 -75.17988 (# workers near firm3) |
4. | firm4 83 39.957938 -75.898837 (# workers near firm4) |
5. | firm5 56 39.575881 -75.169157 (# workers near firm5) |
6. | firm6 73 39.886184 -75.857255 (# workers near firm6) |
7. | firm7 27 39.33288 -75.724665 (# workers near firm7) |
8. | firm8 75 39.165549 -75.96502 (# workers near firm8) |
9. | firm9 64 39.688819 -75.232764 (# workers near firm9) |
10. | firm10 76 39.012228 -75.166272 (# workers near firm10) |
+-----------------------------------------------------------------+
Creating the near_nw_sum variable is my final goal. I need your help here for my weak data management skill.
Upvotes: 1
Views: 9657
Reputation: 11102
The following is basically the same strategy found here and is based on your "final goal". Again, it can be useful depending on the size of your original dataset.joinby
creates observations so you may exceed the Stata limit. However, I believe it does what you want.
clear all
set more off
set seed 123456
set obs 10
g firmid = _n /* Observation (firm) id */
g nw = floor(100*runiform()) /* Number of workers in a firm */
g double lat = 39+runiform() /* Latitude in decimal degree of a firm */
g double lon = -76+runiform() /* Longitude in decimal degree of a firm */
gen dum = 1
list
* joinby procedure
tempfile main
save "`main'"
rename (firmid lat lon nw) =0
joinby dum using "`main'"
drop dum
* Pretty print
sort firmid0 firmid
order firmid0 firmid
list, sepby(firmid0)
* Uncomment if you do not want to include workers in the "base" firm.
*drop if firmid0 == firmid
* Compute distance
vincenty lat0 lon0 lat lon, hav(distance_km) inkm
keep if distance_km <= 40 // an arbitrary distance
list, sepby(firmid0)
* Compute workers of nearby-firms
collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0)
list
What it does is form pairwise combinations of firms to compute distances and sum workers of nearby-firms. No need here to extract scalars as asked in Question 1. Also, no need to complicate the variable firmid
converting to string.
The following overcomes the problem of the Stata limit on number of observations.
clear all
set more off
* Create empty database
gen x = .
tempfile results
save "`results'", replace
* Create input for exercise
set seed 123456
set obs 500
g firmid = _n /* Observation (firm) id */
g nw = floor(100*runiform()) /* Number of workers in a firm */
g double lat = 39+runiform() /* Latitude in decimal degree of a firm */
g double lon = -76+runiform() /* Longitude in decimal degree of a firm */
gen dum = 1
*list
* Save number of firms
local size = _N
display "`size'"
* joinby procedure
tempfile main
save "`main'"
timer clear 1
timer clear 2
timer clear 3
timer clear 4
quietly {
timer on 1
forvalues i=1/`size'{
timer on 2
use "`main'" in `i', clear // assumed sorted on firmid
rename (firmid lat lon nw) =0
joinby dum using "`main'", unmatched(using)
drop _merge dum
order firmid0 firmid
timer off 2
timer on 3
vincenty lat0 lon0 lat lon, hav(dist) inkm
timer off 3
keep if dist <= 40 // an arbitrary distance
timer on 4
collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0)
append using "`results'"
save "`results'", replace
timer off 4
}
timer off 1
}
use "`results'", clear
sort firmid0
drop x
list
timer list
However inefficicent, some testing using timer
shows that most of the computation time goes into the vincenty
command which you won't be able to escape. The following is the time (in seconds) for 10,000 observations with an Intel Core i5 processor and a conventional hard drive (not SSD). Timer 1 is the total while 2, 3, 4 are the components (approx.). Timer 3 corresponds to vincenty
:
. timer list
1: 1953.99 / 1 = 1953.9940
2: 169.19 / 10000 = 0.0169
3: 1669.95 / 10000 = 0.1670
4: 94.47 / 10000 = 0.0094
Of course, note that in both codes duplicate computations of distances are made (e.g. both the distances between firm1-firm2 and firm2-firm1 are computed) and this you can probably avoid. As it stands, for 110,000 observations it will take a long time. On the positive side, I noticed this second setup demands very little RAM as compared to the same amount of observations in the first setup. In fact, my 4GB machine freezes with the latter.
Also note that even though I use the same seed as you do, data is different because I create different numbers of observations (not 5000), which makes a difference in the variable creation process.
(By the way, if you wanted to save the value as a scalar you could use subscripting: scalar latitude = lat[1]
).
Upvotes: 2