user15535192
user15535192

Reputation:

Finding string value associated with max value of record subset in long format

For non-longitudinal analysis using long-formatted data, when subjects have multiple visits or records, I will typically hunt down a record within each subject using bysort ID, and set a temporary variable to hold the integer or real value that I found, and then egen max() to find the max value for all records found, then set a final value in record _n==1 for that subject. This is so I can have the values I want from different visits percolate to a single record for each subject. Each single record per subject will then be used during analysis (but not longitudinal, maybe cross-sectional or regression, ANOVA, etc.)

Let's say I want the highest cholesterol (ldl) value for the 3rd year of a trial, where ldl is measured quarterly (every 3 months) for all subjects, which can be accomplished using the code below:

cap drop ldl3tmp
cap drop ldl3max
cap drop ldl3
bysort id (visitdate): gen ldl3tmp = ldl if trialyear==3
bysort id (visitdate): egen ldl3max = max(ldl3tmp) 
bysort id (visitdate): gen ldl3 = ldl3max if _n==1 

Suppose there are initials for the lab technician or phlebotomist that did the blood draw. How can I percolate a string value to record _n==1 that's associated with the greatest ldl value among the subset of records for the 3rd year of the trial? String values can't be sorted, so I am guessing the answer might be to eliminate records for which ldl is not the greatest value in year 3, then the string will be in that record?

In this case, how can I find out what _n is for the maximum value? If I know that, I could use

bysort id (visitdate): drop if _n!=6 //if _n==6 has the max value of ldl 

Upvotes: 2

Views: 146

Answers (1)

user15535192
user15535192

Reputation:

Here is how to find the record number associated with the greatest ldl value within 4 quarterly ldl values in year 3 of a trial. The result is a variable called recmax, which will only be filled in for the specific record where the greatest value was found (among all records for each subject).

cap drop tmpldl3
cap drop maxldl3
cap drop recmax
cap drop visitdate
gen long visitdate = date(dateofvisit, "MDY") //You have to convert date ("MM/DD/YYYY") to a long integer format - based on #days since Jan 1, 1960
bysort id (visitdate): gen tmpldl3 = ldl if trialyear ==3  
bysort id (visitdate): egen maxldl3 = max(tmpldl3) 
bysort id (visitdate): gen recmax = _n if tmpldl3==maxldl3 & tmpldl3!=. & maxldl3!=.

You can then analyze all the other data (such as string data) in that record cross-sectionally (ANOVA, correlation, regression) by specifying if recmax!=. in the trailing if statement for any analysis command. If you are careful, you could also drop all other records with extraneous ldl values not of interest by using the command drop if recmax!=. providing you realize you dropped data and if you save, save to a filename with "_reduced" or "_dropped" in it.

Upvotes: 2

Related Questions