Clinical Applications (Supplemental)
Introduction
The general purpose of this document is to provide clinicians with the basic tools to begin understanding the implications of the proposed Bayesian model. The purpose is to be very practically oriented with full code examples and general code formats to help clinicians naive to Bayesian models, R, and Stan to still be able to utilize the model. Toward this end, excepts of code are provided throughout this document with comments, which are lines of code that begin with the # symbol. These code excerpts can be copied directly into an R environment, and the comments can be retained or deleted without affecting the end result. For clinicians who are unfamiliar with R, it is highly recommended that RStudio is used instead of R basic.
Setting Up the Model
The very first thing that needs to be done is to make sure that all of the needed R packages are ready to use in the R environment. If this is the very first time that these packages are being used, either because R has just been downloaded or because R has been updated, then these libraries have to be first installed. These libraries can be installed as follows:
install.packages(c("rstan", "bayesplot", "shinystan", "bayestestR", "see"), dependencies = TRUE)
After the packages have been install, they just need to be loaded in every time that a new R session is started (including the session in which the packages were downloaded). Loading the packages is similarly easy:
library(rstan)
library(bayesplot)
library(shinystan)
library(bayestestR)
library(see)
Once the appropriate packages are install and loaded, the model can be prepared and run.
Creating the Stan Environment
It is recommended that users consider using a single R environment for every use of the model. The primary reason for this is that RStan will have to compile the model into C++ every time that a new R environment is used. Compiling the model does not take a long time, but it does typically take a few minutes. If this were a model that would be used routinely or even semi-regularly, those minutes of compiling time can add up. So, if you want to have a standing file that has the RStan model already defined and ready for use, then the following code only needs to be run once. If you don’t want to or can’t save the file and are fine with re-compiling the code every time, then the following code would need to be used every time that a new R environment is opened.
Stan models begin with a text string. Note that R is an object-based language, meaning that we perform various functions and save them as named objects that can then be called later. Objects in R are defined either by <-
or =
, but for consistency, the <-
arrow indicator will be used throughout this code. The model text is as follows:
<- '
Model data {
int<lower=1> pop; // number of populations
int<lower=0> dim; // number of dimensions
vector[dim] raw_x; // obtained scores
row_vector[dim] ref_mn[pop]; // sample means
row_vector<lower=0>[dim] ref_stdv[pop]; // sample SDs
matrix[dim, dim] cor_mat[pop]; // intercorrelation matrix
real<lower=1> ref_N[pop]; // input sample size
real<lower=0> eta[pop]; // LKJ prior eta
vector[pop] theta; // pre-test probabilities
}
transformed data{
real<lower=1> nu[pop]; // degrees of freedom
cholesky_factor_corr[dim] L[pop]; // correlation matrix
for(i in 1:pop)
nu[i] = ref_N[i] - 1; // compute degrees of freedom
for(i in 1:pop)
L[i] = cholesky_decompose(cor_mat[i]); // define the correlation matrix from input
}
parameters {
row_vector[dim] mu[pop]; // population mean
row_vector<lower=0>[dim] sigma2[pop]; // population variance
}
transformed parameters {
row_vector[dim] sigma[pop]; // dimulation standard deviation
row_vector[dim] SEmn[pop]; // standard error of mean
row_vector[dim] means[pop]; // compute scaled means
row_vector[dim] Zscore[pop]; // scaled score
vector[pop] log_theta = theta; // post-test probabilities
for(p in 1:pop)
for(j in 1:dim)
sigma[p, j] = sqrt(((ref_N[p]-1)*(ref_stdv[p, j]^2))/sigma2[p, j]); // compute standard deviation
for(p in 1:pop)
for(j in 1:dim)
SEmn[p, j] = sigma[p, j]/sqrt(ref_N[p]); // compute standard error of the mean
for(p in 1:pop)
for(j in 1:dim)
means[p, j] = (mu[p, j]*SEmn[p, j]) + ref_mn[p, j]; // rescale means vector
for(p in 1:pop)
for(j in 1:dim)
Zscore[p, j] = (raw_x[j] - means[p, j]) / sigma[p, j]; // compute scaled scores
for(p in 1:pop)
log_theta[p] += log_theta[p] + multi_normal_cholesky_lpdf(Zscore[p, ] | to_vector(rep_array(0.0, dim)), L[p]); // likelihood
}
model {
for(p in 1:pop)
target += chi_square_lpdf(sigma2[p, ] | nu[p]); // estimated population SD
for(p in 1:pop)
target += normal_lpdf(mu[p, ] | 0, 1); // estimated population means
for(p in 1:pop)
target += lkj_corr_cholesky_lpdf(L[p] | eta[p]); // correlation matrix
target += log_sum_exp(log_theta);
}
generated quantities {
vector[dim] rScore[pop];
vector[pop] postProb;
for(o in 1:pop)
rScore[o] = multi_normal_rng(means[o, ], quad_form_diag(multiply_lower_tri_self_transpose(L[o]), sigma[o, ])); // simulate raw scores
postProb = softmax(log_theta);
}
'
Note that on the top left of this code is the word “Model” followed by <-
and then a large chunk of text between two single quotes ' ... text ... '
. The text in quotes is the actual model, and the arrow indicates that this text will be saved as the object “Model” in the R environment. To demonstrate what this means, the following code would print out what is contained in this new “Model” object:
#print out what is contained in the Model object
cat(Model)
For convenience, there are comments in the model text. In Stan, these comments are defined by //
instead of #
like in other R formats. The text to the right of //
are short descriptions of what each line of code is doing, but they are entirely unnecessary for the model itself.
Once the model text is added, it is now time to convert this text to a compiled C++ object, which fortunately RStan does for us. The following code compiles this model:
#convert syntax into Stan model (this will take a couple minutes)
<- stan_model(model_code = Model, model_name="stan_model") stan_model
In this case, we are creating an object called “stan_model” that is the result of a special function from RStan that is itself called stan_model
. This function compiles the actual model (this model is named “stan_model” per the model_name="stan_model"
argument). Note that there is no inherent need for naming the text object “Model” or the compiled Stan model as “stan_model.” These names could be changed to anything. Readers are cautioned to be mindful of these name changes, however. All of the following code will call on these objects as named above, so if a user changes these names, then later code will need to be adjusted to match these new names.
After a few minutes, the model should finish compiling. If you intend on using the same R environment every single time, then you will not need to repeat these steps again.
Inputting Data
In the above text for the model, there is a block that defines the data (data{ ... text ... }
). This block is aptly named the data block and tells Stan what the input data should look like. Starting at the top of this block and going down that column of text, the model expects a positive integer of at least 1 (int<lower=1>
) telling it the number of populations to compare, a positive integer (int<lower=0>
) corresponding to the number of tests (or dimensions) that are being entered, a vector of obtained scores, a list of row vectors for sample means corresponding to these tests, a list of row vectors of sample standard deviations corresponding to the tests, a list of correlation matrices for these tests, the number of individuals used in these samples, the prior that will be used for the correlation matrix, and the pre-test probabilities for each population.
To contextualize these values, say that it is desired to examine WAIS scores. Say as well that we are interested in 5 primary diagnostic possibilities, so in this case, we would need to specify this as pop <- 5
. Assuming we administer the core subtests only, there are 10 tests of interests, so when we run the model, we will need to specify that dim <- 10
. It is difficult to interpret WAIS scores if there are no scores, so clearly an individual has been administered the WAIS and obtained certain scaled scores as a result. These specific results will be entered into a string: raw_x <- c(10, 12, 8, 13, 9, 12, 11, 10, 8, 10)
as an example. Lets say that the first reference sample that we are interested in evaluating is just the normative sample. In this case, the reference sample means are all 10 with standard deviations of 3: ref_mn <- c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10)
and ref_stdv <- c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
. There would then need to be 4 other reference means and standard deviations to enter since we defined the number of populations as 5 (i.e., normal + 4 other possibilities). In the event that you are examining reference means or standard deviations that are the same for every test (like in this particular case), then an easier way of specifying this is to use the rep
function, which is short for repeat. In this case, for example, we could do the following: ref_mn <- rep(10, dim)
and ref_stdv <- rep(3, dim)
. These statements tell R to repeat the value 10 and 3, respectively, both however many times dim
has been defined as (as shown earlier in this paragraph, dim
= 10 for this case). The next required piece of information is the correlation matrix for these 10 tests. This information is usually available in the test manual. A full entry is shown later, so for the sake of space, it is not repeated here. The last information about the reference sample being used is just its N. For illustration purposes, we will assume that our particular age-based cell in the normative manual was based on 100 individuals: ref_N <- 100
. The next data requirement is unrelated to the sample and individual. Currently, there is not sufficient evidence on what \(\eta\) values may be best to use in which circumstances, so it is recommended that this just be defined as 1 routinely: eta <- 1
. Finally, if there is any pre-test probability information to include, then this can be specified by passing a vector to the theta
argument. In this case, let’s say that we have no a priori reason to believe that any single diagnosis is more likely than the others. We can specify this uniform pre-test probability as follows: theta <- rep(1/5, 5)
.
The following code provides an empty template for entering in data as described above. Comments are present to help clarify what information would be added. These comments should be deleted and replaced with the needed values.
<- #number of diagnostic populations
pop
<- #number of tests
dim
<- c(#score on test 1, #score on test 2, #score on test 3, ...)
raw_x
<- rbind(
ref_mn c(#sample mean on test 1 in Population 1, #sample mean on test 2 in Population 1, ...),
c(#sample mean on test 1 in Population 2, #sample mean on test 2 in Population 2, ...),
...)
<- rbind(
ref_stdv c(#sample SD on test 1 in Population 1, #sample SD on test 2 in Population 1, ...),
c(#sample SD on test 1 in Population 2, #sample mean on test 2 in Population 2, ...),
...)
<- list(matrix(c(#1, A, B, C,
cor_mat #A, 1, D, E,
#B, D, 1, F,
#C, E, F, 1),
nrow = dim,
ncol = dim,
byrow = TRUE),
matrix(c(#1, G, H, I,
#G, 1, J, K,
#H, J, 1, L,
#I, K, L, 1),
nrow = dim,
ncol = dim,
byrow = TRUE),
...)
<- c(#sample size for Sample 1, #sample size for Sample 2, ...)
ref_N
<- rep(1, pop)
eta
<- rep(1/pop, pop) #for uniform pre-test probabilities theta
Since these data need to be declared for every single reference sample that we wish to use for comparison, it can be a bit difficult to get data ready in R. Clinicians are likely to have certain tests, studies, and norms that they refer to often. As a result, it could be helpful to have a series of objects that have these reference data (i.e., means, standard deviations, and correlation matrices) ready for future uses. In this case, a clinician may put in some upfront time to enter in these data from a variety of resources, but then these objects can be called up without needing to re-enter them later, which may save time in the long run. In these cases, instead of generic names like ref_mn
, objects may be named in such a way as to be meaningful to the clinician. Below is an example where information about the RBANS normative sample is saved as an object that could then be called at any time without having re-enter the data:
<- 6
RBANS_dim
<- rep(100, RBANS_dim)
RBANS_mn
<- rep(15, RBANS_dim)
RBANS_sd
<- matrix(c(1.00, 0.29, 0.37, 0.47, 0.64, 0.78,
RBANS_cor 0.29, 1.00, 0.29, 0.31, 0.34, 0.63,
0.37, 0.29, 1.00, 0.38, 0.32, 0.66,
0.47, 0.31, 0.39, 1.00, 0.40, 0.72,
0.64, 0.34, 0.32, 0.40, 1.00, 0.76,
0.78, 0.63, 0.66, 0.72, 0.76, 1.00),
nrow = RBANS_dim,
ncol = RBANS_dim,
byrow = TRUE,
dimnames = list(c("IM", "VC", "AT", "LA", "DM", "TS"), c("IM", "VC", "AT", "LA", "DM", "TS")))
#the dimnames argument specifies the names of the rows and columns
#the first character string given by c("text" ... "text") is the row names
#the second string given by c("text" ... "text") is the column names
<- 540 RBANS_N
Using the above as an example, whenever I would want to use the RBANS normative sample as one of the comparisons, then rather than retype all the above information, these objects would be in the R environment (assuming it was saved and is being used again) and could therefore be called directly.
An important caveat about data entry is that it is assumed that all of the reference data are entered in the same order of tests as the client’s raw scores are entered. In the case of the RBANS, if the raw scores were entered for the Immediate Memory Index, then the Visuospatial/Construction, then the Attention, then Language, then Delayed Memory, and lastly Total Scale, then the reference means, standard deviations, and correlation matrix need to match that exact same order from left to right. There is no way that R will know that the tests are out of order, so the results could be significantly impacted if the order of these entries is not exactly the same.
Running the Model
Now that the data are defined, these data objects can be passed to Stan in order to run the model.
It is recommended that users run multiple chains of sampling. These multiple runs are not necessary for the model to work; however, they are useful for checking whether or not the sampling worked as needed. In the future, sufficient research on the stability of the model may render this unnecessary entirely, but since the model is still being developed, it is better to be cautious. Four chains are considered to be the minimum number of chains to run, but users could add more if so desired (note that there is no accuracy or performance improvement to adding more chains). To speed up the posterior sampling over these chains, it is recommended to run these chains in parallel across multiple processors. If you know how many processors your computer has available, then it is possible to specify this number directly; however, the following code detects the number of available cores and automates the determination of how to handle parallelization of the chains.
#make sure model runs as quickly as possible
options(mc.cores = parallel::detectCores())
If the model is run on 4 chains, then specifying more than 4 cores (i.e., options(mc.cores = 12)
versus options(mc.cores = 4)
) does not result in a speed increase. Running the model is relatively straight forward. The following code is the blank template that will just be customized for each reference sample of interest.
<- sampling(stan_model, data = list(pop = pop, dim = dim, raw_x = raw_x, ref_mn = ref_mn, ref_stdv = ref_stdv, cor_mat = cor_mat, ref_N = ref_N, eta = eta, theta = theta)) stan_sample
Assuming that the data are entered using the general template provided above, there is no need to change any of the names of objects in this code. In the event, that a user has a special set of frequently used reference samples then the names can be fairly easily replaced. The following is an example that continues with the RBANS reference sample illustrated earlier.
<- sampling(stan_model, data = list(pop = pop, dim = RBANS_dim, raw_x = raw_x, ref_mn = RBANS_mn, ref_stdv = RBANS_stdv, cor_mat = RBANS_mat, ref_N = RBANS_N, eta = eta, theta = theta)) stan_RBANS
In R, there is no warning that an object is about to be rewritten. Operating systems like Windows make us acutely aware anytime that we try to save a document with the same name as an existing one, but R has no such good graces. As a result, if we have multiple reference samples but enter each of these samples means under the same named object ref_mn
, then which ever is the last sample to have been entered will be what is saved as ref_mn
and there will be no trace of any of the previous ones. This is important for case-by-case use since if one client’s data is overwritten by another’s, then there’s no way to go back to the old data unless it is re-entered. This is particularly relevant for defining common tests/batteries or diagnostic contrasts since it could result is a lot of redundant work if ref_mn
is defined each time versus giving it a special name (e.g., RBANS_mns
). Similarly, if a client’s data needs to be run again or checked later, then there’s not a way to cycle back to it unless it was saved as a specific object in the R environment (e.g., 9023856_10-05-2019 <- sampling(...)
where the first digits may refer to an MRN that is then followed by the assessment date).
Over time, it can be possible to start accumulating large numbers of objects stored in the R environment, particularly depending on one’s needs and naming conventions. There is nothing inherently wrong with this, but if the models are run repeatedly in the same R environment, then there may quickly become an overabundance of old and new objects that cause the file to become very large and the environment messy. A good practice therefore would be to retain a few commonly used objects but then clear out any that are not going to be used anymore. Objects can be removed/deleted from R with the function rm()
. Here’s an example:
rm(ref_mn1, ref_mn2, ref_mn3,
ref_stdv1, ref_stdv2, ref_stdv3,
cor_mat1, cor_mat2, cor_mat3, ref_N1, ref_N2, ref_N3)
Another option to keep the environment tidy is to condense commonly used reference samples into lists and then deleting the individual objects. Here’s an example of this option:
#create new lists and then delete the old objects
<- list("RBANS" = RBANS_mn, "WAIS" = WAIS_mn, ...)
Means <- list("RBANS" = RBANS_sd, "WAIS" = WAIS_sd, ...)
StDvs <- list("RBANS" = RBANS_cor, "WAIS" = WAIS_cor, ...)
CorMats <- list("RBANS" = RBANS_N, "WAIS" = WAIS_N, ...)
SampleSize
rm(RBANS_mn, WAIS_mn, RBANS_sd, WAIS_sd, RBANS_cor, WAIS_cor, RBANS_N, WAIS_N)
#extract a value from a list
$RBANS #equivalent to having called RBANS_mn
Means$RBANS #equivalent to having called RBANS_sd
StDvs
#add new object to an existing list
"CVLT"]] <- CVLT_mn
Means[[
rm(CLVT_mn)
#add new reference without first creating a new object
"CVLT"]] <- c(#mean Trial 1 scaled score, ...) Means[[
Clinical Interpretations
Every Stan object returned from the sampling
function will contain the z-scores estimated for each obtained score based on whatever reference sample was used. In order to see these z-scores, you need to request a summary of the object.
summary(stan_sample)
If the model was run with multiple chains, then this will provide a summary for each chain and then for the combined chains. Generally speaking, there is no reason to look at the values for each individual chain, so the following simplifies the output. First, it returns just the combined chains results. Second, it returns the estimates for only the z-scores as opposed to the other parameters that the model estimates (e.g., the population mean) as these are not usually relevant for clinical interpretations. Third, it shows the point estimate for each z-score, the standard deviation of the posterior distributions, the lower and upper bounds for the 95% credible interval, and some model diagnostics (see next section for details).
summary(stan_sample, pars = "Zscore")$summary[, c(1, 3:4, 8:10)]
It can also be helpful sometimes to visualize the distribution of these scores. A plot for the posterior distribution of the z-scores can be called with the following code:
plot(stan_sample, show_density = TRUE, ci_level = 0.80, outer_level = 0.95, pars = "Zscore")
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
You will note in these plots that the Zscore parameter is indexed by two values, which in this case are 1 and then another integer between 1 and 10. In cases where more than one population is fit at a time, the first number in the bracket will index the z-scores from that population while the second number in the bracket always indexes the test number. So, for illustration, the Zscore[1, 5] data corresponds to the z-score for the 5th test in the first population.
Checking the Model
Using the above code example, the final two columns reported will be for n_eff
and Rhat
, which correspond to effective sample size and a convergence diagnostic, respectively. With regard to effective sample size, the larger the numbers the better. The effective sample size should be interpreted in the context of the number of samples run in total. Using the values specified in the Stan sampling template, the posterior contains 4,000 samples, so we should see similarly large effective sample sizes. When effective sample size is small, this suggests that the model did not run efficiently. This likely would be a sign that the data were entered incorrectly, but it could also occur in a case in which the model is severely misspecified. Most users should not see an issue with effective sample size using this model, and such an error would likely be the product of a data entry error.
The \(\hat{R}\) value is a measure of the way in which the chains have mixed. If only a single chain was run, then there will be no estimate here (and the user will receive an out of bounds error message from R if the summary code given earlier is used since the 10th column corresponds to this statistic). When there is good mixing of the chains, the value for \(\hat{R}\) should be 1. When \(\hat{R}\) exceeds 1.01, the chains are considered to have not mixed properly. Failure for the chains to mix suggests that the posterior space was not adequately characterized, which threatens the validity of the model’s results.
\(\hat{R}\) is a simple indicator of mixing. It is important to also check the trace plots of the chains. These plots should look like fuzzy caterpillars or very tightly constricted seismometer graphs. An example of this plot for the z-scores is shown below.
plot(stan_sample, plotfun = "trace", pars = "Zscore")
In the event that there is evidence that the model may not have stabilized (i.e., had good mixing), the results from the model should not be interpreted. Additionally, the package shinystan
provides a wide variety of very helpful diagnostic plots and tables to help further identify potential issues with the model. A Stan object can be viewed in this package as follows: shiny <- as.shinystan(stan_sample)
and then launch_shinystan(shiny)
.
Pre-test Probabilities
A powerful advantage of Bayesian methods in diagnostic modeling is the ability to utilize pre-test probabilities to arrive at individualized post-test probabilities of diagnosis/disease. In the best case scenario, a client may present with a criterion-based pre-test risk. For example, the individual may have been administered a cognitive screening test like the MoCA and failed it at a certain threshold. This information could be converted to a pre-test probability for cognitive impairment, or in the presence of additional information, for certain etiologies of impairment. Alternatively, an individual’s probability of belonging to one of two diagnostic groups may be estimated using a logistic regression before the neuropsychological tests are interpreted. The results of testing could then be combined with the pre-test probability estimate from the logistic regression. In the simplest case, however, the pre-test probability is just the prevalence of the diagnosis given the relevant demographic traits of the individual.
Specifying pre-test probabilities is very simple in the model syntax as it involves just changing the theta argument: theta <- c(1/3, 1/3, 1/3)
versus theta <- c(1/2, 1/4, 1/4)
. Where this becomes particularly relevant, however, is when we must “chain” several tests in the battery together. In the simulation study, we had data for the RBANS and the WAIS across 3 shared conditions (normal cognition, MCI, and AD), but there wasn’t a single data source for the RBANS + WAIS in all three conditions. As a result, we had to examine the WAIS in those 3 conditions first, extract the new post-test probabilities from that analysis, and then examine the RBANS in those 3 conditions. In this case, pre-test probabilities based on the prevalence of the various diagnoses given the “client’s” age were updated by the WAIS subtests (resulting in a post-test probability based on WAIS performance) and then those post-test probabilities were treated as pre-test probabilities for the RBANS that were in kind updated into the final set of post-test probabilities. This chaining of pre-/post-test probabilities and updating diagnostic beliefs can be repeated an arbitrary number of times. The following code demonstrates how to do this within R.
#run the sampler and estimate the posteriors
#(assumes uniform pre-test probability over 3 populations)
<- sampling(stan_model, data = list(..., theta = c(1/3, 1/3, 1/3)))
stan_sample
#update theta
<- summary(stan_sample, pars = "postProb")$summary[, 1]
theta
#run next iteration of the sampler with new pre-test probabilities
<- sampling(stan_model, data = list(..., theta = theta)) stan_sample
Posterior Hypothesis Testing
The model text at the very beginning of this document ends with a block for generating posterior predictive distributions. This block is called the generated parameters block, and in this case, there are only two values being estimated: postProb
and rScore
. The former was discussed above as it reflects the post-test probability estimates from the model. The only additional point to mention here is that there is a distribution of post-test probabilities, meaning that we can conduct inference on these data (e.g., test whether a post-test probability is significantly different from zero). Caution should be used in inference of the post-test probability in cases where models are linked/chained together as the full-distribution of post-test probabilities is not passed to the model, meaning that uncertainty of the post-test probability is not passed through the model at each step.
The latter value being estimated, rScore
, corresponds to the model’s estimate for what the test scores look like in the population. The other rMarkdown document for this study concludes with some visualizations of the posterior distribution for reference. These posterior estimates can be extracted from the Stan object and then used to explore additional questions. For example, it is often of interest whether or not the difference between two scores is significant and how common such a difference is. The posterior distribution samples can provide some information about such questions. The following code demonstrates how to visualize the difference between two test scores using the simulated WAIS data.
#extract the raw score posterior estimates
<- as.data.frame(extract(stan_sample, pars = "rScore"))
posterior_dist
#view a histogram of the difference between Similarities and Vocabulary
hist(c(round(posterior_dist[, 2]) - round(posterior_dist[, 5])), main = "Difference between Similarities and Vocabulary", xlab = "Scaled Score Difference")
As is apparent from the plot of the difference, there is a fair amount of probable mass that there is no difference between the scores; however, on the whole, it is more likely that the Vocabulary scaled score is higher than the Similarities scaled score. It is also possible to determine the percentile ranks of differences. For example, the following code finds the empirical percentile for differences less than or equal to -3, 0, and 3 points:
#convert the differences into percentiles
<- round(posterior_dist[, 2]) - round(posterior_dist[, 5])
posterior_contrast
#find the percentile for a difference of -3
ecdf(posterior_contrast)(-3)
## [1] 0.12075
#find the percentile for a difference of 0
ecdf(posterior_contrast)(0)
## [1] 0.56675
#find the percentile for a difference of 3
ecdf(posterior_contrast)(3)
## [1] 0.923
If one is not interested in the percentiles but instead wants to know critical values for certain contrasts, then the can also be requested. For example, if one wants to know whether a score is within the 95% credible intervals, then the following code can be used:
quantile(posterior_contrast, c(0.025, 0.975))
## 2.5% 97.5%
## -4 5
Using the results above, if someone had a Similarities and Vocabulary discrepancy of -5, then this would be considered statistically significant at the .05 level. Just for transparency, the WAIS critical values are not based on the sampling distributions of the statistics but instead on the measurement distributions (i.e., they take into account the standard error of measurement). As a result, these critical values likely do not equate to those reported in the manual as they are answering a different question. The model does not currently take into consideration measurement error.
Using these methods of inference on the posterior distribution, it is possible to arrive at a general idea of how rare certain differences may be in the population. This method has unique potential, but there is additional research needed to clarify whether the posterior predictive distributions convey the same information in clinical data as they have in simulated normal distributions.
Another possibility that may be of interest to clinicians is formally testing the amount of evidence for one hypothesis over another. So far, we have focused on point-estimates on the raw scores; however, better inference can be conducted on the z-scores since the model produces a distribution of credible z-scores, reflecting uncertainty about the true value of this metric. We will highlight two different examples of how to conduct inference on these z-scores. The first is a more traditional statistical inference where we are just concerned with whether certain values occur and its probability of occurring. The second is more Bayesian-specific and allows for quantification of evidence for certain hypotheses. To illustrate how to do this inference, we need to first check what the z-scores for the model that we have been examining actually is. The code below will provide a simple summary (mean + 95% credible interval bounds) of the estimated z-scores for every population entered into the model (in this example, only a single population was used, so only one set of z-scores is returned):
summary(stan_sample, pars = "Zscore")$summary[, c(1, 4, 8)]
## mean 2.5% 97.5%
## Zscore[1,1] 0.971674508 0.7321868 1.2144992
## Zscore[1,2] -0.373599971 -0.5706510 -0.1778445
## Zscore[1,3] 0.069104929 -0.1258885 0.2643925
## Zscore[1,4] -0.508473758 -0.7113047 -0.2962591
## Zscore[1,5] -0.326399214 -0.5221226 -0.1298804
## Zscore[1,6] -0.008147266 -0.2017064 0.1848201
## Zscore[1,7] 1.238107437 0.9742185 1.4961545
## Zscore[1,8] 0.684942524 0.4683809 0.8976411
## Zscore[1,9] -1.438513220 -1.7165940 -1.1611550
## Zscore[1,10] 1.042644724 0.8059517 1.2853731
This particular simulated “individual” had a markedly poorer performance on Information than the other subtests, so we will focus on this subtest for our inferences. The first thing to notice is that none of the 95% credible interval includes 0, so we can probably safely reject the null hypothesis that the performance is “average” (i.e., a z-score of exactly zero). We can, however, go even further in our inferences since we have a full posterior distribution to examine. One thing we may ask is what is the probability that the score is zero or something close to zero. This usually involves defining something called the region of practical equivalence (or ROPE). ROPEs make for more sensible hypothesis testing because it is rarely of interest what the probability of an exact value is (e.g., what the probability of a score being zero). In this case, let us say that a z-score in the range of -0.10 and 0.10 are practically equivalent to zero. The following code will return the probability that the z-score for Information is practically no different from zero.
#get posterior distribution of z-score estimates
<- as.data.frame(extract(stan_sample, pars = "Zscore"))
posterior_dist
#compute probability that Information (Zscore #9) is between -0.10 and 0.10
mean(posterior_dist[, 9] >= -0.10 & posterior_dist[, 9] <= 0.10)
## [1] 0
This results indicates that there is an estimated probability of zero, though a probability can truly never be zero. It can be helpful sometimes to visualize these estimates, and the combination of the bayestestR
and see
packages facilitates some fairly easily plotting methods. The following code produces visualizations of the probability of direction (i.e., the probability that a parameter is either positive or negative) and a few different ways of evaluating the ROPE.
#probability of direction plot
plot(p_direction(posterior_dist))
#practical significance plot
plot(p_significance(posterior_dist))
#ROPE plot
plot(rope(posterior_dist, ci = c(0.90, 0.95)))
These plots all help to explore various aspects of the data and performance. The probability of direction is directly proportional to a traditional p-value as it reflects the probability that a parameter is not zero (i.e., that it has a direction – either positive or negative). The practical significance helps to visualize the point estimate of zero (the dotted vertical line) but then also how much of the posterior distribution is within the ROPE. This particular test visualizes just a one-sided test based on the probability of direction. So, if the majority of the posterior distribution is negative, then only that negative portion is tested against the ROPE. Finally, the ROPE plot shows the band corresponding to the ROPE (by default this is -0.10 to 0.10), and the plot also automatically color coordinates the requested credible intervals (in this case, the 90% and 95% were requested via the argument ci = c(0.90, 0.95)
). As can be seen, this plot is useful for visualizing varying levels of certainty about an effect. For example, looking at z-score posterior distributions for tests 3 and 6 (Digit Span and Arithmetic, respectively), the ROPE falls well within the distributions’ regions of highest probability suggesting that there is high probability the the z-scores are likely zero or practically zero. In contrast, consider the distribution for test 4’s z-score (Matrix Reasoning), in this case, the ROPE only falls just beyond the 95% credible intervals. This result can be interpreted directly in terms of probability: the probability that this z-score is not practically zero is >0.95.
It is likely that clinicians will not be frequently using these tests over such narrow ROPEs as it is not entirely common to need to test against a specific score; instead, it may be more helpful to think about the probability that a score falls within a particular qualitative range. In this context, we might consider a score to be “average” if is between 90 and 109, or about +/- 1 SD. In the same manner as before, we can compute the probability of being in this ROPE as follows:
#compute probability that Information (Zscore #9) is between -1 and 1
mean(posterior_dist[, 9] >= -1.00 & posterior_dist[, 9] <= 1.00)
## [1] 0.00175
#visualize the new ROPE for all tests
plot(rope(posterior_dist, range = c(-1, 1), ci = c(0.90, 0.95)))
Once again, we see that there is a very small probability that the Information score is in the “average” range as defined by the rough +/- 1 SD rule. The ROPE visualizations also confirm this, and based on this view in particular, it becomes clear that the Information subtest is likely a relative weakness for this simulated “individual.” While working with probabilities is helpful, it can sometimes be more useful to think about clinical hypotheses in terms of null and alternative hypotheses. For example, the null hypothesis might be that a score is “average” relative to a target population (in this case, just the normative population was used), and an alternative might be the observed z-score. A method for comparing the evidence for a null hypothesis to the alternative hypothesis in the Bayesian framework is the aptly named Bayes factor. A simple computation for this is the ratio of the odds of the null hypothesis to the odds of the alternative. Conveniently, these probabilities previously computed can be converted directly to odds via the formula \(\frac{prob}{1-prob}\). The following code shows how to conduct the Bayes factor in two parameterizations: one showing the evidence favoring the null and the other showing the evidence favoring the alternative.
#define probabilities
<- mean(posterior_dist[, 9] >= -1.00 & posterior_dist[, 9] <= 1.00)
pN <- 1-pN
pA
#convert probabilities to odds
<- pN/(1-pN)
oN <- pA/(1-pA)
oA
#BF10 (null to alternative)
<- oN/oA
BF10
#BF01 (alternative to null)
<- oA/oN
BF01
#show the results
BF10
## [1] 3.073247e-06
BF01
## [1] 325388.8
#show that the results are inversely related
1/BF10 == BF01
## [1] TRUE
There are two things to observe regarding these analyses. First, the Bayes factor of evidence favoring the null hypothesis will always equal the inverse of Bayes factor of evidence favoring the alternative, so it is only ever necessary to compute one Bayes factor. Second, when a Bayes factor is greater than one then this is taken as positive evidence of that hypothesis over the other. The larger this number, the stronger the evidence for the hypothesis. In this way, the Bayes factor can be thought of as an effect size for a hypothesis with very large values reflecting robust evidence that the hypothesis of interest is true. A Bayes factor could be clinically relevant for a couple of reasons. For one, it may of interest to show that a score on a particular test (or combination of tests) is within a certain range. This might occur when specific score ranges are needed for diagnoses or if performance from prior testing is used for comparison (e.g., is current testing within the same interval as prior testing?). A strength of the Bayes factor is that it is able to quantify support for a null hypothesis, which is not possible in frequentist methods. In frequentist methodologies, to perform a kind of equivalence testing would involve specialized methods (e.g., TOSTs). Another potential application of Bayes factors might arise in validity testing. When validity testing is highly disparate from expectations, it would be possible to test the amount of evidence favoring an abnormal performance to a normal one. A final application that readily comes to mind is in the case of score discrepancies. One might imagine comparing a predicted score (and its prediction interval) to observed scores to test whether there is a difference in these.
These Bayes factors are a posterior-driven estimate as they include limited prior information. We can improve the Bayes factor estimates by specifying a prior distribution. Since these are z-scores, a reasonable prior distribution to use for this purpose is a standard normal distribution (i.e., normal distribution with mean of 0 and standard deviation 1). Using this prior, we can get more informative Bayes factors using the following code via the bayestestR
package:
#compute Bayes factors given the assumed prior
bayesfactor_parameters(posterior_dist[, 9], prior = distribution_normal(20000, 0, 1), null = c(-1, 1))
## Bayes Factor (Null-Interval)
##
## BF
## --------
## 1.25e+03
##
## * Evidence Against The Null: [-1.000, 1.000]
#plot the Bayes factor
plot(bayesfactor_parameters(posterior_dist[, 9], prior = distribution_normal(20000, 0, 1), null = c(-1, 1)))
As can be seen from this updated Bayes factor, the evidence favoring the alternative hypothesis is still very strong but much more conservative than the posterior-driven estimate. In all cases, it is more accurate to specify a prior distribution and should be the default approach, particularly because we know that a good prior should be the standard normal distribution. The plot also is very helpful in understanding what the Bayes factor is telling us. What we had calculated before was an assumed uniform distribution with range from -1 to 1 as the null hypothesis, but this ignores that a standard normal distribution (a) extends beyond this range and (b) does not place equal probability over all values between -1 and 1. With a better, more appropriate prior distribution assumed, the Bayes factor is now the ratio of odds that a value between -1 and 1 occurs in the standard normal prior (null) distribution to the odds that a value is in that range in the posterior (alternative) distribution. This comparison is much more sensible and capitalizes on the fact that Bayesian methods allow us to make inferences directly on distributions rather than point estimates and then assumptions about the sampling distributions. The plot of this Bayes factor shows the posterior (red) and prior (blue) distributions and then the range in which we are testing against (the shaded box between -1 and 1). Visually, it is clear that what we are considering to be “null” values are much more common in the prior distribution than the posterior, meaning that we should be skeptical that the posterior estimate equivalent to the null hypothesis. Put differently, we have strong evidence that the simulated “individual’s” Information performance is significantly different from “average” when we treated average as scores within one standard deviation of the normative sample.
We will now apply these methods to some further exploration to highlight the ways in which the posterior can be used to make detailed inferences about scores. So far, we have been examining just single scores or displaying descriptive plots for all tests. One thing that clinicians may be interested in is extending these methods to combinations of tests such as what we had described earlier. To showcase these applications, we consider a test for whether the Information performance is significantly different from the average subtest performance. For this, we will compute the average for all subtests, including Information.
#compute mean and standard deviation of all subtests
<- mean(unlist(posterior_dist))
SubMean <- sd(unlist(posterior_dist))
SubSD
#probability that the average subtest performance is greater than Information
mean(SubMean > posterior_dist[, 9])
## [1] 1
#compute Bayes factor (assuming that prior has mean = mean of all subtests and sd = sd of all subtests)
bayesfactor_parameters(posterior_dist[, 9], null = c(SubMean-0.1, SubMean+0.1), prior = distribution_normal(20000, SubMean, SubSD))
## Bayes Factor (Null-Interval)
##
## BF
## --------
## 5.01e+09
##
## * Evidence Against The Null: [0.035, 0.235]
#visualize Bayes factor for this analysis
plot(bayesfactor_parameters(posterior_dist[, 9], null = c(SubMean-0.1, SubMean+0.1), prior = distribution_normal(20000, SubMean, SubSD)))
In the case above, it is fairly clear that the Information subtest performance is significantly different from the average subtest performance. For this Bayes factor, a narrow ROPE is specified (i.e., the mean of subtests +/- 0.10), but this can be arbitrarily adjusted for clinical needs. In this case, the interest was just in whether or not the two distributions differed significantly, which would normally be something like a t-test comparing the mean of the prior distribution (the mean and standard deviation of all the subtests together) and the mean of the posterior distribution (the estimated z-score for Information). For this reason, a more traditional ROPE made sense.
There are many ways in which the posterior distributions can be utilized to explore and describe test performance, but we have tried to select a few applications that might be more common. Similarly, there are many additional opportunities within R for visualizing these tests or even analyzing the posterior distributions. Since we wanted this supplemental file in particular to be beginner-friendly, we selected only a handful of packages that utilize targeted functions; however, clinicians familiar with R and/or Bayesian methods are encouraged to consider these alternatives. Readers are also reminded that in a typical call, the model will produce parameter estimates for every test in every population. These analyses have focused on a single population, so the code used will not be exact for cases where there are more than one populations’ parameters estimated. Throughout the accompanying R code from the simulation studies are examples of how to run the model one single populations (as well as multiple populations). For certain purposes, it may be most convenient to run the model on various populations to get the post-test probabilities of diagnoses before then running the model on specific populations of interests. For example, it might be of interest to compute certain performance descriptives relative to the normative sample such as what was highlighted in the document, so one may follow-up the “diagnostic” run of the model with a “descriptive” run where the only population entered is the normative one. Similarly, if the post-test probabilities were to return a diagnosis of amnestic MCI as the most likely diagnoses, then it could be helpful for recommendations and prognosis to then run a descriptive model on just this single population to compare a client’s performance to the expected population of individuals with amnestic MCI. These separate descriptive runs are not needed since all the same information will be returned for each population; however, this is mentioned as an alternative since (a) the models run very quickly so there is little time-associated cost to running an extra model and (b) when there are multiple populations being fit in the model, then a clinician will need to be more careful in terms of which effects they are selecting to describe. When just the one population is estimated, the posterior samples can be extracted and used more easily.
Just to highlight what we mean with regard to care in analyzing the posteriors when multiple populations are fit, consider a case where the model has been run using reference samples for 6 different populations. Say that the test battery includes 12 tests. When the code posterior_dist <- as.data.frame(extract(stan_model, pars = "rScore"))
is used, there will be 12$$6 = 72 columns that correspond to the posterior predictions of all 12 tests in each of the 6 populations. So, if we wanted to do posterior inference on the 7th test in the 3rd population (let’s say that this population is the normative population), then we would need to request just that specific column like this: posterior_dist[, 43]
or equivalently posterior_dist[, (12*3)+7]
. What the second call makes explicit is that the columns are in order over the 12 tests for each of the populations, so to get to a specific test within a specific population, we have keep track of how many columns would come before it. When just a single population is fit, then this is not a concern. To get the 7th test when there is just a single population is as simple as calling posterior_dist[, 7]
. Clearly this is easier and prone to less error from potentially indexing the wrong column (e.g., getting results for a different test or from a different population than intended). This is ultimately, however, a matter of preference.