user177196
user177196

Reputation: 728

Compute recursion for one column conditionally on values in another columns

I was given dataset named Temp.dat with 2 columns (Dataset here). I initially formed structure named structure data_t data[100] so that I could arrange the columns in an increasing order based on the first column (Column 0 = min(failure time, censored time), Column 1 indicates 1 = Death observation, 0 = censored observation). A portion of the structured dataset has the following form

0.064295 1 
0.070548 1 
0.070850 1 
0.071508 0 
0.077981 1 
0.086628 1 
0.088239 1 
0.090754 1 
0.093260 0 
0.094090 1 
0.094367 1 
0.097019 1 
0.099336 1 
0.103765 1 
0.103961 1 
0.111674 0 
0.122609 0 
0.123730 1 

Now, I want to write the C code to form different time periods whose endpoints always end with entry 1in the 2nd column. Looks like the following:

Expected output - 3rd column (Time Interval) added

0.064295 1 [0 0.064295)        
0.070548 1 [0.064295 0.070548) 
0.070850 1 [0.070548 0.070850) 
0.071508 0 [0.070850 0.077891) ---> Skip 0.071508 here because of 0 in column 1 
0.077981 1 [0.070850 0.077981)
0.086628 1 [0.077981 0.086628) 
0.088239 1 [0.086628 0.088239) 
0.090754 1 [0.088239 0.090754) 
0.093260 0 [0.090754 0.094090) 
0.094090 1 [0.090754 0.094090) 
0.094367 1 [0.094090 0.094367) 
0.097019 1 [0.094367 0.097019) 
0.099336 1 [0.097019 0.099336) 
0.103765 1 [0.099336 0.103765) 
0.103961 1 [0.103765 0.103961) 
0.111674 0 [0.103961 0.123730) 
0.122609 0 [0.103961 0.123730) 
0.123730 1 [0.103961 0.123730)

So far, I am unable to write the code to perform this. So if anyone could help on this step, I would sincerely appreciate it.

Next, I wrote up the following code to get the output shown below. Note that column 2 is not what I want, but this is the best thing so far I could get.

  double array[8][MAX];
  double total = 100;
  for(int i = 0; i < MAX; i++) { 
    double start = 0;
    double count = 0;
    if(i) start = data[i - 1].x; 
    array[0][i] = data[i].x; 
    array[1][i] = data[i].y; 
    array[2][i] = start; 
    array[3][i] = data[i].x;
    array[4][0] = count;
    array[5][0] = count;
    array[6][0] = total;
    array[7][0] = 1;
    /*keep track of number of deaths and censors at each time t_i*/
    if (fmod(arr[1][i], 2.0) == 1)
      {arr[4][i+1]  = count + 1.0;
       arr[5][i+1]  = count;
      }
    else {arr[4][i+1] = count;
          arr[5][i+1] = count + 1.0;
         }

  return(0);
}

Sample Output

0.064295 1 [0.060493 0.064295) 1.000000 0.000000 191.000000 0.950000
0.070548 1 [0.064295 0.070548) 1.000000 0.000000 190.000000 0.945000
0.070850 1 [0.070548 0.070850) 1.000000 0.000000 189.000000 0.940000
0.071508 0 [0.070850 0.071508) 1.000000 0.000000 188.000000 0.940000
0.077981 1 [0.071508 0.077981) 0.000000 1.000000 187.000000 0.935000
0.086628 1 [0.077981 0.086628) 1.000000 0.000000 186.000000 0.929973
0.088239 1 [0.086628 0.088239) 1.000000 0.000000 185.000000 0.924946
0.090754 1 [0.088239 0.090754) 1.000000 0.000000 184.000000 0.919919
0.093260 0 [0.090754 0.093260) 1.000000 0.000000 183.000000 0.919919

Column 7 stands for KM estimator of survival distribution function. It was computed based on the following rules: 1. If the i-th entry in column 1 is 0, simply save the corresponding i-th entry in column 6 equal to the previous (i-1)th- entry in the same column. 2. If the i-th entry in column 1 is 1 but one or multiple successive entries before it is 0 (for example, the last entry of column 1 is followed right before by two 0s), we compute the corresponding i-th entry in column 6 with the formula: (i-1)-th entry*(1- 1/(j-th entry in column 5)) where the j-th entry in column 5 corresponds to the nearest entry 1 in column 1 (for example, the last 4 rows of column 1 has 1 0 0 1 in it, which implies the last entry in column 6 would be computed as 0.890096*(1-1/177) where 177 = the first entry in column 5 which has the corresponding entry in column 1 = 1 (rather than 0).

Task left to finish: First, I need to form the right column 2 so that for a random input t in the range of column 0, the code would give the corresponding result in column 6.

Second, I want to compute the variance of KM estimator, using this formula: S(t)^2*(summation over t_i <= t) d_i/(r_i*(r_i-d_i)),

where S(t) = the KM estimator computed at time t (column 7 above), d_i is the total number of deaths up to index i (so, sum of entries up to d_i of column 5 above), r_i = i-th entry in column 6. For example, if t = 0.071, then t_i only has 3 possible values based on Column 0 (t_i would be 0.064295, 0.070548 and 0.070850). I came up with the following working code (not sure if the output was the correct ones)

  N = [an integer]; #define size of array here
  double sigma[N];
  sigma[0] = 0;
  double sum[N];
  sum[0] = 0;
  for(int i=1; i< N; i++){
     sum[i] = sum[i-1] + (float)(arr[4][i]/(arr[6][i-1]*(arr[6][i])));
     sigma[i] = pow(arr[7][i],2)*sum[i];
     printf("%.0lf", sigma[i]);
  }

Sample Output

0.004775
0.004750
0.004725
0.004700
0.004675
0.004700
0.004650
0.004625
0.004600
0.004575
0.004600
0.004550
0.004525
0.004500
0.004475
0.004450
0.004425
0.004450
0.004450
0.004400
0.004375
0.004350
0.004325
0.004300
0.004275
0.004250
0.004225
0.004200
0.004175
0.004149
0.004124
0.004150
0.004099
0.004074
0.004100
0.004049
0.004024
0.004051
0.003999
0.003974
0.004001
0.003949
0.003976
0.003923
0.003898
0.003926
0.003873
0.003848
0.003823
0.003797
0.003772
0.003747
0.003775
0.003722
0.003750
0.003696
0.003725
0.003671
0.003700
0.003646
0.003676
0.003621
0.003595
0.003570
0.003544
0.003519
0.003549
0.003494

Upvotes: 3

Views: 116

Answers (1)

Barmak Shemirani
Barmak Shemirani

Reputation: 31599

This is a partial answer. First, lets declare the array as arr[MAX][8], that means you have MAX rows and 8 columns. This makes it easier to sort the data.

Next, lets create dummy data 0.100, 0.101, ... that's easier to look at it.

To find the 5th column, you can use an additional loop (for(int j = i; j < count; j++){...}) to find the next non-zero value.

We have to keep track of total dead counts (dead_count) and increment each time arr[i][1] is zero.

Kaplan-Meier formula is taken as 1 - (double)dead_count/(double)count

MCVE would look like:

#include <stdlib.h>
#include <stdio.h>

int compare_2d_array(const void *pa, const void *pb)
{
    double a = *(double*)pa;
    double b = *(double*)pb;
    if(a > b) return 1;
    if(a < b) return -1;
    return 0;
}

int main(void)
{
    double arr[][8] =
    {
        { 0.100, 1, 0, 0, 0, 0, 0 , 0 }, //initialize columns
        { 0.101, 1 }, // we can skip adding the zeros, it's done automatically
        { 0.102, 1 },
        { 0.103, 0 },
        { 0.104, 1 },
        { 0.105, 1 },
        { 0.106, 1 },
        { 0.107, 1 },
        { 0.108, 0 },
        { 0.109, 1 },
        { 0.110, 1 },
        { 0.111, 1 },
        { 0.112, 1 },
        { 0.113, 1 },
        { 0.114, 1 },
        { 0.115, 0 },
        { 0.116, 0 },
        { 0.117, 1 },
    };

    int count = sizeof(arr)/sizeof(*arr);

    //sort
    qsort(arr, count, sizeof(arr[0]), compare_2d_array);

    int dead_count = 0;
    for(int i = 0; i < count; i++)
    {
        double start = i ? arr[i - 1][0] : 0;
        double end = arr[i][0]; //<- I don't know what to use as default value!

        //if arr[i][1] is zero, then end should equal the next non-zero value
        double end;
        for(int j = i; j < count; j++)
        {
            end = arr[j][0];
            if(arr[j][1])
                break;
        }

        arr[i][2] = start;
        arr[i][3] = end;
        arr[i][4] = arr[i][1];
        arr[i][5] = !arr[i][1];

        if(!arr[i][1])
            dead_count++;

        printf("%3d %.6lf %.0lf [%.6lf %.6lf) %.0lf %.0lf %3d %.6lf\n", 
            i, 
            arr[i][0], 
            arr[i][1], 
            start,
            end, 
            arr[i][4], 
            arr[i][5], 
            count - i, 1 - (double)dead_count/(double)count );
    }

    return 0;
}

Output:

  0 0.100000 1 [0.000000 0.100000) 1 0  18 1.000000
  1 0.101000 1 [0.100000 0.101000) 1 0  17 1.000000
  2 0.102000 1 [0.101000 0.102000) 1 0  16 1.000000
  3 0.103000 0 [0.102000 0.104000) 0 1  15 0.944444
  4 0.104000 1 [0.103000 0.104000) 1 0  14 0.944444
  5 0.105000 1 [0.104000 0.105000) 1 0  13 0.944444
  6 0.106000 1 [0.105000 0.106000) 1 0  12 0.944444
  7 0.107000 1 [0.106000 0.107000) 1 0  11 0.944444
  8 0.108000 0 [0.107000 0.109000) 0 1  10 0.888889
  9 0.109000 1 [0.108000 0.109000) 1 0   9 0.888889
 10 0.110000 1 [0.109000 0.110000) 1 0   8 0.888889
 11 0.111000 1 [0.110000 0.111000) 1 0   7 0.888889
 12 0.112000 1 [0.111000 0.112000) 1 0   6 0.888889
 13 0.113000 1 [0.112000 0.113000) 1 0   5 0.888889
 14 0.114000 1 [0.113000 0.114000) 1 0   4 0.888889
 15 0.115000 0 [0.114000 0.117000) 0 1   3 0.833333
 16 0.116000 0 [0.115000 0.117000) 0 1   2 0.777778
 17 0.117000 1 [0.116000 0.117000) 1 0   1 0.777778

Upvotes: 1

Related Questions