Reputation: 17
I have a dataset of individuals in a certain group. Each has a unique numeric ID. We ask each to give the IDs of their friends in the group.
The problem is the IDs of any individual's friends have been coded in a single string.
Here's an example :
input int user_id strL coop_friends_list_1
79 "81, 80, 93, 92, 87, 94, 89, 88, 83, 84, 97"
80 "82, 83, 89, 88, 93, 92, 87, 81, 97, 84"
81 "82, 89, 93, 92, 87, 88, 79, 84, 80, 97, 83"
82 "80, 81, 87, 92, 93, 97"
83 "92, 80, 87, 81"
84 "92, 97, 82, 87, 88, 93, 89, 80, 79, 83"
85 "95, 98, 94, 91, 86, 90, 96"
86 "94, 96, 85, 91, 98, 95, 90"
87 "83, 81, 92, 88, 89, 93, 82, 80, 79, 84, 94"
88 "80, 81, 84, 87, 89, 92, 93, 94"
So for the first line, person #79 has given #81, 80, 93, 92, 87, 94, 89, 88, 83, 84 and 97 as their friends.
What I would like to do is :
My main issue is that the number of friends is not the same for everyone, so it will create one variable for each friend, but some individuals will have missing values if they have less friends than another.
Upvotes: 0
Views: 53
Reputation: 37183
On #1 what do you want that is different? (Full disclosure: putative author of split
here.)
On #2 the number of identifiers is the number of commas plus 1.
gen wanted = 1 + strlen(coop) - strlen(subinstr(coop, ",", "", .))
The number of commas is counted from the reduction in length if they were removed.
Another way to do it is to replace commas by spaces and get a word count. If your commas are always followed by spaces, a word count might work directly.
EDIT: wordcount()
works fine on the data example, as words are defined as whatever spaces separate (subject to quotation marks binding tighter than spaces, a qualification that doesn't bite here). If there were any doubt about commas always being followed by spaces, then replacing ","
with ", "
or indeed " "
would ensure correct parsing.
Upvotes: 1