String variable contains multiple entries, number of entries differs by observations. Can I turn it into byte and sum the number of entries?

I have a dataset of individuals in a certain group. Each has a unique numeric ID. We ask each to give the IDs of their friends in the group.

The problem is the IDs of any individual's friends have been coded in a single string.

Here's an example :

input int user_id strL coop_friends_list_1
79 "81, 80, 93, 92, 87, 94, 89, 88, 83, 84, 97"
80 "82, 83, 89, 88, 93, 92, 87, 81, 97, 84"    
81 "82, 89, 93, 92, 87, 88, 79, 84, 80, 97, 83"
82 "80, 81, 87, 92, 93, 97"                    
83 "92, 80, 87, 81"                            
84 "92, 97, 82, 87, 88, 93, 89, 80, 79, 83"    
85 "95, 98, 94, 91, 86, 90, 96"                
86 "94, 96, 85, 91, 98, 95, 90"                
87 "83, 81, 92, 88, 89, 93, 82, 80, 79, 84, 94"
88 "80, 81, 84, 87, 89, 92, 93, 94"            

So for the first line, person #79 has given #81, 80, 93, 92, 87, 94, 89, 88, 83, 84 and 97 as their friends.

What I would like to do is :

  1. transform the string into numeric entries. I can do so using the split, destring command which I already did for another variable.

My main issue is that the number of friends is not the same for everyone, so it will create one variable for each friend, but some individuals will have missing values if they have less friends than another.

  1. I would like to count the number of friends each person has. I do not want the sum of IDs: I want to count the number of individual IDs cited for each observation and store that value in a final variable.

Upvotes: 0

Views: 53

Answers (1)

Nick Cox
Nick Cox

Reputation: 37183

On #1 what do you want that is different? (Full disclosure: putative author of split here.)

On #2 the number of identifiers is the number of commas plus 1.

 gen wanted = 1 + strlen(coop) -  strlen(subinstr(coop, ",", "", .)) 

The number of commas is counted from the reduction in length if they were removed.

Another way to do it is to replace commas by spaces and get a word count. If your commas are always followed by spaces, a word count might work directly.

EDIT: wordcount() works fine on the data example, as words are defined as whatever spaces separate (subject to quotation marks binding tighter than spaces, a qualification that doesn't bite here). If there were any doubt about commas always being followed by spaces, then replacing "," with ", " or indeed " " would ensure correct parsing.

Upvotes: 1

Related Questions