The Bootstrap Method in SAS in a nutshell



Ruurd Bennink

    Ruurd Bennink, OCS Life Sciences

Author: Ruurd Bennink - Sr. Analist/Programmer

Idea behind the bootstrap method is to find an as accurate estimate as possible of the standard error of the mean (SE) when only limited data are available.

More detailed information about the bootstrap method can be found via this link. https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

For example if an interim analysis is performed with just 10 subjects with primary parameter a change from baseline. Then there is not enough information to determine accurate estimates for the standard error of the mean. The bootstrap method picks randomly 10 observations from this dataset with replacement and repeats that 10, 20, 100 or 1000 times. The more often this is performed, the better estimates are possible for the standard error of the mean.

Below the actual SAS program with comments included:


/* Create a datasets _ORIG with 10 random numbers */

DATA _orig;
      DROP j;
      DO j = 1 TO 10;
             x = Ranuni(1); /* Random numbers ranging from 0 to 1 from the uniform
                               distribution. Other distributions e.g. the normal
                               distribution are also fine. Because the seed > 0 each
                               run will create an identical dataset as the previous run */
             z = j; /* Marker for the jth observation. This makes it easier to
                       identify which observations occur more often in the dataset
                       BOOT&j in the next datastep */
             OUTPUT;
       END;
RUN;
%macro bootstrap;
  %do j=1 %to 1000; /* Create 1000 datasets with 10 samples with
                       replacement from dataset _ORIG */
  DATA boot&j;
      /* Pick 10 times randomly a number with replacement
         from the dataset _ORIG */
          DO obsi = 1 TO 10;
            k = Ceil(ranuni(0)*10);
              /* Use the CEIL function. If the Round function is used, also 0
                 would be a possible outcome and the 0th observation does not exist, 
                 but no ERROR/WARNING will appear in the SAS log! Besides that the
                 10th observation will then have a probability of 5% to be selected,
                 which makes the selection less random. 
                 Using seed = 0 means that the time of the day is used to initialize 
                 the seed stream. If a seed > 0 is used all datasets BOOT&j will be
                 identical! 
                 Here use only the uniform distribution to make sure that the 
                 observation numbers 1 to 10 have equal probability to be selected. */

        SET _orig Point=k;
              /* The POINT option points to the kth observation to be selected.
                 Because k is a random number ranging from 1 to 10 every time the
                 'Obsi' loop starts, some numbers may appear more than once, 
                 which reflects the resampling 'with replacement' element of the
                 Bootstrap method. */

            seqnum = &j; /* To identify the dataset, later to be used as BY
                            variable for e.g. PROC MEANS */

            OUTPUT;
            END;

    STOP; /* Mandatory for POINT= option */

  RUN;
  %end;

%mend bootstrap;

%bootstrap;

As a next step a PROC MEANS can be used to estimate the mean for each dataset BOOT&j and based on those means calculate the standard deviation of the mean. Another possibility is to append the datasets BOOT1 to BOOT1000 and use PROC MEANS with a BY statement, BY seqnum;

DATA total;
  %macro append;
    SET %do n=1 %to 1000; BOOT&n %end;
    ;
  %mend append;
%append;
RUN;