BayesTraits Wrapper (btw) version 1
IMPORTANT NOTE: This is an out-of-date version of btw
. For the up-to-date version (which includes Windows compatibility, support for BayesTraitsV3, and a much simpler approach to specifying BayesTraits commands), see here.
This is an R package for running BayesTraits from R (Mac OS and BayesTraits V2 only). The functions work by using system
to run BayesTraits on your system and delete the output files after importing them into R.
BayesTraits was developed by Mark Pagel and Andrew Meade, and is available from their website as an executable file that can be run from a command line program such as Terminal. Consult the BayesTraits documentation before using btw
.
I think there are two major advantages to using btw
rather than running BayesTraits from the command line. First, if you are comfortable with R, btw
makes it easy to incorporate BayesTraits into R scripts. Second, btw
automatically pulls the output of btw
into R in a ready-to-use format, which greatly simplifies your workflow if you want to analyze your BayesTraits output in R.
Below, I demonstrate usage of btw
.
Set-up
If you have the devtools
package installed, you can intall version 1 of btw
from GitHub like this (note that the ref="v1"
argument specifies version 1 of btw
:
library(devtools)
install_github("rgriff23/btw", ref="v1")
library(btw)
Before using any of the functions that call BayesTraitsV2, you have to download Version 2 of BayesTraits. You then have to tell btw
where to find it on your computer by defining the hidden variable .BayesTraitsPath
. You have to do this every time you start a new R session. Be sure to specify the FULL path.
.BayesTraitsPath <- "~/Desktop/GitHub/btw/BayesTraitsV2"
The package includes some examples of tree and data files. You can load them into your R session after you load btw
:
data(primates)
Some important formatting points that apply to the data for the Discrete
and Continuous
functions:
- Phylogenies must be of class
phylo
ormultiPhylo
- The first column of your data file must contain species names
- Species names must match exactly between the tree and data (but order doesn’t matter)
- You cannot have spaces in your species names
- Discrete characters should be represented with characters or factors (not integers!) between 0 and 9
- A discrete character/factor with more than 1 digit is interpreted as ambiguous (e.g., “01”) indicates that the species is equally likely to be in state “0” or “1” (now you see why the discrete characters can’t be of class ‘integer’, because 01 would just be seen as 1)
- If testing for correlated evolution in discrete characters, use 0 and 1 as the character states
- The only valid way to represent missing data is with a “-“ character
Now let’s use btw
to run some BayesTraits analyses.
Using Discrete in ‘Multistate’ mode
Multistate is used when you have data on a single categorical trait. An example of this sort of data is `primate.discrete1’, which contains data on trait that takes 3 discrete values (0, 1, and 2). Viewing the data reveals one missing value (Colobus_polykomos) and one ambiguous value (Hylobates_agilis).
primate.discrete1
species trait
1 Cercocebus_torquatus 1
2 Cercopithecus_mona 0
3 Cercopithecus_nictitans 0
4 Colobus_angolensis 1
5 Colobus_guereza 2
6 Colobus_polykomos -
7 Homo_sapiens 0
8 Hylobates_agilis 01
More than two character states can be modeled, but keep in mind that the number of model parameters increases rapidly with the number of character states. Specifically, for n states there are $n^2$ - n rate parameters, such that by the time you get to 4 states, 12 rates must be estimated!
Let’s fit a maximum likelihood model where we allow each of the 6 rates (0->1, 1->0, 1->2, 2->1, 0->2, 2->0) to be independent, and use a likelihood ratio test to compare it to a symmetric model where all rates must be equal to each other.
asymmetric <- Discrete(primate.tree1, primate.discrete1, "ML")
symmetric <- Discrete(primate.tree1, primate.discrete1, "ML", res="q01")
lrtest(symmetric, asymmetric)
model1.Lh model2.Lh LRstat pval
1 -4.572531 -4.572531 0 1
The likelihood ratio test is not significant, indicating that we should favor the simpler model. In this case, the symmetric model is simpler because it only estimates one parameter, while the asymmetric model estimates six.
Reversible jump hyper-prior model
Another way to get at the question of whether there is evidence for multiple rates of evolution among the 3 character states is to run a reversible jump model, which will sample across models in proportion to their likelihood and return a distribution of model structures. We will also take phylogenetic uncertainty into account by sampling across a block of 100 trees, primate.tree100
.
multirj <- Discrete(primate.tree100, primate.discrete1, "Bayesian", rj="uniform -100 100", bi=10000, it=1000000, sa=1000)
We can use the function rjmodels
to get a quick summary of the models sampled in the reversible jump analysis.
rjout <- rjmodels(multirj)
rjout
$NumModels
[1] 406
$TopTen
'0 Z 0 0 0 0 '0 0 0 Z 0 0 '0 0 0 0 0 0 '0 0 0 Z Z 0 'Z 0 0 0 0 0
0.03636364 0.02727273 0.02424242 0.01818182 0.01717172
'0 0 0 Z 0 Z '0 Z 0 0 0 Z 'Z 0 Z 0 0 0 '0 Z 0 0 Z 0 '0 0 0 0 1 0
0.01616162 0.01616162 0.01616162 0.01212121 0.01010101
Output indicates that an extremely large number of models were sampled in the posterior distribution (406!) and the top ten models represented an extremely small proportion of the total sample of models (the best model represented about 4% of all the models). This indicates that support for any one model over other possible models is weak, and in light of this, the simplest model should be chosen (i.e., all rates equal).
Fossilizing and reconstructing an ancestral state
Next, let’s compare an unconstrained model to a model where we fossilize the ancestor of Homo_sapiens and Hylobates_agilis to be 2. And let’s do it in a Bayesian framework.
fossilape <- Discrete(primate.tree1, primate.discrete1, "Bayesian", fo="2 Homo_sapiens Hylobates_agilis")
nofossil <- Discrete(primate.tree1, primate.discrete1, "Bayesian")
bf <- bftest(fossilape, nofossil)
bf
BayesFactor BetterModel
1 2.778082 Model 2
The Bayes factor is 2.778082 and the unconstrained model is preferred. We can also estimate the probabilities of different ancestral states for the Homo_sapiens/Hylobates_agilis ancestor. Let’s use maximum likelihood this time (the default setting).
reconstruct <- Discrete(primate.tree1, primate.discrete1, mrca="Homo_sapiens Hylobates_agilis")
reconstruct
Tree.No Lh q01 q02 q10 q12 q20 q21 Root.P.0.
1 1 -4.57253 0.030713 0 0 0.230283 0.144508 0.407705 0.329525
Root.P.1. Root.P.2. mrcaNode1.P.0. mrcaNode1.P.1. mrcaNode1.P.2.
1 0.334345 0.33613 0.439357 0.263306 0.297338
Results indicate that there is only a 29.7% chance that the Homo_sapiens/Hylobates_agilis ancestor was in state 2, which is consistent with our finding that the model where we fossilized that ancestor at 2 was significantly worse than the unconstrained model.
Using Discrete in ‘Discrete’ mode
Discrete can be used to test for correlated evolution between a pair of binary characters. An example of this data is primate.discrete2
, which contains a pair of binary traits.
Independent vs. dependent evolution test
Let’s do the significance test for correlated (i.e., dependent) evolution for the two traits in maximum likelihood mode.
nocorrD <- Discrete(primate.tree1, primate.discrete2)
corrD <- Discrete(primate.tree1, primate.discrete2, dependent=TRUE)
lrtest(corrD, nocorrD)
model1.Lh model2.Lh LRstat pval
1 -5.686009 -6.786591 2.201164 0.1379066
The difference is not significant, indicating that the simpler model (no correlation) should be preferred. The output from these models can be a little confusing, but the plotdiscrete
function allows visualization of the results.
plotdiscrete(nocorrD, main="Independent")
plotdiscrete(corrD, main="Dependent")
In this analysis, each pair of states for the two traits is treated as a separate state (00, 01, 10, 11). The independent model forces some pairs of rate parameters to be identical, because the idea is that transition rates for each trait should be the same, indepent of what state the other trait is in (this leads to 4 overall parameters). By contrast, the dependent model assumes that transition rates for each trait depend on the state of the other trait, such that different rates need to be estimated (leading to 8 overall parameters). Due to the different constraints imposed on the two models, their transition rate matrices look very different, but as the likelihood ratio test demonstrates, the difference is not significant.
Using Continuous
The Continuous
function can be used to compare models of evolution for one or more continuous traits. The format for the data files is the same as for discrete traits.
Phylogenetic signal test
Continuous supports the estimation of several parameters that correspond to different models of evolution. Perhaps the most widely used of these is lambda, the phylogenetic signal parameter developed by Mark Pagel (1999). Let’s test the hypothesis that the maximum likelihood estimate of lambda is significantly different from 0 for a single continuous trait.
lambda0 <- Continuous(primate.tree1, primate.continuous1, lambda=0)
lambdaML <- Continuous(primate.tree1, primate.continuous1, lambda="ML")
lrtest(lambdaML, lambda0)
model1.Lh model2.Lh LRstat pval
1 1.115431 1.115431 0 1
The p-value is not significant, so there is no evidence that lambda is different from 0 (which corresponds to no phylogenetic signal in the data).
Correlated evolution test
We can test for a significant correlation between pairs of continuous traits. The tc
parameter fixes the correlation between two traits to be 0, and this can be used as a null model to test for a significant correlation.
nocorrC <- Continuous(primate.tree1, primate.continuous2, tc=TRUE)
corrC <- Continuous(primate.tree1, primate.continuous2)
lrtest(nocorrC, corrC)
model1.Lh model2.Lh LRstat pval
1 -6.18303 -5.736392 0.893276 0.3445911
The p-value is not significant, so there is no evidence for a correlation between these traits.
Regression model
Continuous can also be used to fit regression models with 1 or more predictor variables. The first variable in the dataframe is automatically treated as the response variable. Let’s fit a regression model with two predictor variables.
glm1 <- Continuous(primate.tree1, primate.continuous3, regression=TRUE)
glm1
Tree.No Lh Alpha Beta.2 Beta.3 Var R.2
1 1 -3.448363 0.05771658 -0.5048425 1.243642 0.008530046 0.4391672
SSE SST Error.Ratio s.e..Alpha s.e..Beta.2 s.e..Beta.3
1 0.06824037 0.1216769 0.4391672 0.344801 0.734077 0.570176
Kappa Delta Lambda OU
1 1 1 1 0
We can compare the standard errors of regression parameters to their estimated values to assess whether they are significant. In this model, the estimate of Beta.2 is -0.5 with a standard error of 0.73, while the estimate of Beta.3 is 1.24 with a standard error of 0.57. Thus, only Beta.3 is significantly different from 0 (because 0 lies outside the interval 1.24 +/- 0.57).
MCMC diagnostics
A convenient thing about running BayesTraits with btw
is that the output from Bayesian MCMC analyses are basically ready to be analyzed with the R package coda
for doing MCMC diagnostics. Here, I provide some examples of how coda
can be used to do health assessments for MCMC chains. First, let’s make some data (this won’t be pretty… the point is just to see how it works).
coda.demo <- Discrete(primate.tree100, primate.discrete1, "Bayesian", it=1000000)
Now load the coda
package (if it isn’t already installed, then install it!).
library(coda)
In one easy step, we can prepare the MCMC output for analysis in coda
. The key is that coda
functions work on objects of class mcmc
, so we need to convert columns from our dataframe into mcmc
objects before using coda
plotting functions. We can do this by wrapping the variable we want to plot with the mcmc
function.
Here is a density plot for the likelihood.
densplot(mcmc(coda.demo$Lh))
Here is a trace plot for the likelihood.
traceplot(mcmc(coda.demo$Lh))
Here is an autocorrelation plot for the likelihood.
autocorr.plot(mcmc(coda.demo$Lh))
This all just goes to say that btw
makes it easy for you to use the tools provided by coda
to do MCMC diagnostics. Check out the coda
documentation for more info!
References
Pagel, M. (1999). “Inferring the historical patterns of biological evolution.” Nature 401(6756): 877-884.