Breaking downs SAS merges to simplify derivations
using EPOCH derivation as an example
By Bas van Bakel
INTRODUCTION
In this newsletter I would like to talk about a programming technique to combine data which I’ve found particularly useful in certain situations. The technique itself is not complex (it is mainly a different way to approach data merges than you may be used to), but it is very useful to break-up a complex merge into several non-complex steps.
A well-known example of a more complex merge with many imputations and assumptions is the assignment of the ELEMENT in which an adverse event started. This requires Clinical Programmers to combine and compare information from the Adverse Events and Subject Element datasets (e.g., comparing the AESTDTC with the SESTDTC and SEENDTC), often requiring them to make several assumptions and imputations in the programs/macros they generate. For example, which ELEMENT needs to be selected if the adverse event start date can fall in more than one ELEMENT? How are partial dates handled? Etcetera…
Using the assignment of ELEMENTs to Adverse Events as an example, I will illustrate how this programming technique can be used. You will see that the technique is easy to understand and follows a systematic, modular, approach that allows you to easily add/change or rearrange pieces of the derivations.
PREPARATION
The Subject Elements domain (SE) contains the EPOCHs/ELEMENTs of all subjects. As defined in the SDTM guidelines there are no gaps between the elements (meaning that every single point in time the subject is in the trial can be assigned to at least one element).
In order to ensure that all timepoints (from zero to infinity) are available, we can pre-process the SE domain and add PRE-STUDY and POST-STUDY elements for each subject:
Picture 1: Addition of PRE-STUDY and POST-STUDY elements
Now that the preprocessing on the SE data is done the starting point would be to apply a [Cartesian product (https://en.wikipedia.org/wiki/Cartesian_product)] (per subject) on the AE and SE datasets. This means that, per adverse event in the AE dataset, all elements in SE for the same subject will be attached:
Picture 2: Result of Cartesian product of SE and AE datasets.
To illustrate: each colored box in the picture above was a single input record in AE, but it is being duplicated for each element in the SE dataset. The variable UNIQUE is being added to keep track of which records are unique source records.
TECHNIQUE
After the above preparation steps are done, we get to the core of the problem: we need to determine which of the ELEMENTs attached to the AE record is the one to be assigned to the Adverse Event itself. In order words: for each unique adverse event we want to keep one and only one record.We will do this by adding conditions and rules to systemically drop records from the above Cartesian product until only one record remains per unique AE source record (using the variable UNIQUE):
Picture 3: Decision tree
In order to determine there is only one record remaining a small macro like below can be used:
The conditions themselves may differ per project or per source dataset. I will give some examples below, but they can be changed, deleted, rearranged and/or new conditions can be added. In all cases the process remains the same: the conditions are applied until only one record remains.
For example, we can have five rules ([A] up to [E]) as mentioned below:
[A]A first condition can be to compare the AESTDTC versus the SESTDTC and SEENDTC dates and remove the records where the AESTDTC cannot fall in the period defined by SESTDTC and SEENDTC (also taking partial dates into account):
Applying the above code on the example of picture 2 would already reduce the number of records to one because all AESTDTC dates can fall in only one ELEMENT (when comparing against SESTDTC and SESTDTC).
Picture 4: Result of applying rule A (date check) on the Cartesian product of SE and AE datasets. Please note that applying the add_doneflag to the resulting dataset results in DONEFLAG = 1 values.
In some cases, further processing needs to be done (doneflag = 0), because the AESTDTC is partial or is equal to a SESTDTC. In the below example you can see that the AESTDTC itself is not conclusive because more than one record remains per value in UNIQUE:
Picture 5: Example of data where more conditions are to be applied.
A next condition can be to check the AESTRTPT and AESTTPT variables and use them to delete the non-appropriate records. In picture 5 the information in these variables can be used to drop the element SCREENING, because the AESTDTC was marked as being AFTER VACCINATION).
[C]If dates and other timing variables are not conclusive and there can still be multiple elements assigned to an AESTDTC after the previous steps, one can prioritize certain elements. In this case we might consider dropping PRE-STUDY and POST-STUDY elements if there are other elements left. I.e., it is assumed that the event occurred within the trial period then.
[D]Similarly as in condition [C], we can let treatment elements take precedence. If there are both treatment and non-treatment elements, the non-treatment elements will be removed.
In picture 5, for subject 1017, the BACTERIAL INFECTION will therefore be assigned to the first element shown there, not to the follow-up element.
[E]If there can still be multiple elements assigned to an AESTDTC after the previous steps, the user can decide to take the first or the last record that is remaining.
You can add more conditions, (F, G, etc.), remove any of the conditions and/or change the order of the conditions as you see fit. There are a few things that need to be taken into account:
- Ensure that you start with the conditions that exclude situations that CANNOT happen (condition [A] and [B] in the example);
- Then add the conditions that prioritize certain records over others (condition [C] and [D] in the example);
- Ensure that you end with a condition that picks only ONE alternative (condition [E] in the example);
- When processing is done for all records, check that each unique source record has exactly 1 output record.
If you are in the situation where you need to apply similar merges and the set of rules may differ or are not yet known, try the above technique. It may prove to be very useful indeed.