The aphylo
R package implements estimation and data imputation methods for Functional Annotations in Phylogenetic Trees. The core function consists of the log-likelihood computation of observing a given phylogenetic tree with functional annotation on its leaves and the probabilities associated to gain and loss of function, including probabilities of experimental misclassification. The log-likelihood is computed using peeling algorithms, which required developing and implementing efficient algorithms for re-coding and preparing phylogenetic tree data to be used with the package. Finally, aphylo
works smoothly with popular tools for analysis of phylogenetic data such as ape
R package, “Analyses of Phylogenetics and Evolution.”
The package is under MIT License and is developed by the Computing and Software Cores of the Biostatistics Division’s NIH Project Grant (P01) at the Department of Preventive Medicine at the University of Southern California.
citation(package="aphylo")
in publications use the following paper:
To cite aphylo
al. (2021)
Vega Yon GG, Thomas DC, Morrison J, Mi H, Thomas PD, et for automatic annotation of gene
Bayesian parameter estimation
functions using observational data and phylogenetic trees. PLOS17(2): e1007948.
Computational Biology ://doi.org/10.1371/journal.pcbi.1007948
https
:
And the actual R package
G (2022). _Statistical Inference of Annotated Phylogenetic
Vega Yon 0.3-2,
Trees_. R package version <https://github.com/USCBiostats/aphylo>.
in BibTeX format, use 'print(<citation>,
To see these entries bibtex=TRUE)', 'toBibtex(.)', or set
'options(citation.bibtex.max=999)'.
This package depends on another on-development R package, the fmcmc
. So first, you need to install it:
devtools::install_github("USCbiostats/fmcmc")
Then you can install the aphylo
package
devtools::install_github("USCbiostats/aphylo")
: ape Loading required package
# This datasets are included in the package
data("fakeexperiment")
data("faketree")
head(fakeexperiment)
LeafId f1 f21,] 1 0 0
[2,] 2 0 1
[3,] 3 1 0
[4,] 4 1 1 [
head(faketree)
ParentId NodeId1,] 6 1
[2,] 6 2
[3,] 7 3
[4,] 7 4
[5,] 5 6
[6,] 5 7 [
O <- new_aphylo(
tip.annotation = fakeexperiment[,2:3],
tree = as.phylo(faketree)
)
O
4 tips and 3 internal nodes.
Phylogenetic tree with
:
Tip labels1, 2, 3, 4
:
Node labels5, 6, 7
Rooted; no branch lengths.
Tip (leafs) annotations:
f1 f21 0 0
2 0 1
3 1 0
4 1 1
:
Internal node annotations
f1 f25 9 9
6 9 9
7 9 9
as.phylo(O)
4 tips and 3 internal nodes.
Phylogenetic tree with
:
Tip labels1, 2, 3, 4
:
Node labels5, 6, 7
Rooted; no branch lengths.
# We can visualize it
plot(O)
plot_logLik(O)
set.seed(198)
dat <- raphylo(
50,
P = 1,
psi = c(0.05, 0.05),
mu_d = c(0.8, 0.3),
mu_s = c(0.1, 0.1),
Pi = .4
)
dat
50 tips and 49 internal nodes.
Phylogenetic tree with
:
Tip labels1, 2, 3, 4, 5, 6, ...
:
Node labels51, 52, 53, 54, 55, 56, ...
Rooted; no branch lengths.
Tip (leafs) annotations:
fun00001 1
2 0
3 0
4 1
5 0
6 0
...(44 obs. omitted)...
:
Internal node annotations
fun00001 1
2 1
3 1
4 1
5 1
6 0
...(43 obs. omitted)...
# Parameters and data
psi <- c(0.020,0.010)
mu_d <- c(0.40,.10)
mu_s <- c(0.04,.01)
eta <- c(.7, .9)
pi_root <- .05
# Computing likelihood
str(LogLike(dat, psi = psi, mu_d = mu_d, mu_s = mu_s, eta = eta, Pi = pi_root))
2
List of $ Pr:List of 1
$ : num [1:99, 1:2] 0.018 0.686 0.686 0.018 0.686 0.686 0.018 0.018 0.018 0.686 ...
..$ ll: num -40.4
# Using L-BFGS-B (MLE) to get an initial guess
ans0 <- aphylo_mle(dat ~ psi + mu_d + Pi + eta)
# MCMC method
ans2 <- aphylo_mcmc(
dat ~ mu_d + mu_s + Pi,
prior = bprior(c(9, 1, 1, 1, 5), c(1, 9, 9, 9, 5)),
control = list(nsteps=5e3, burnin=500, thin=10, nchains=2))
: While using multiple chains, a single initial point has been passed
Warning`initial`: c(0.9, 0.5, 0.1, 0.05, 0.5). The values will be recycled.
via
Ideally you would want to start each chain from different locations.
5500 steps. Gelman-Rubin's R: 1.0314. (500 final count of samples). Convergence has been reached with
ans2
ESTIMATION OF ANNOTATED PHYLOGENETIC TREE
: aphylo_mcmc(model = dat ~ mu_d + mu_s + Pi, priors = bprior(c(9,
Call1, 1, 1, 5), c(1, 9, 9, 9, 5)), control = list(nsteps = 5000,
burnin = 500, thin = 10, nchains = 2))
LogLik (unnormalized): -20.0599
: mcmc (5500 steps)
Method used# of Leafs: 50
# of Functions 1
# of Trees: 1
Estimate Std. Err.0.9093 0.0827
mu_d0 0.1608 0.0767
mu_d1 0.1015 0.0669
mu_s0 0.1022 0.0443
mu_s1 0.5318 0.1443 Pi
plot(
ans2,
nsample = 200,
loo = TRUE,
ncores = 2L
)
# MCMC Diagnostics with coda
library(coda)
gelman.diag(ans2$hist)
:
Potential scale reduction factors
Point est. Upper C.I.1.00 1.02
mu_d0 1.02 1.11
mu_d1 1.00 1.01
mu_s0 1.01 1.06
mu_s1 1.01 1.02
Pi
Multivariate psrf
1.03
plot(ans2$hist)
pred <- prediction_score(ans2, loo = TRUE)
pred
score (H0: Observed = Random)
Prediction
: 99
N obs.
: 0.71 ***
Observed : NA
Random P(<t) : 0.0000
--------------------------------------------------------------------------------
0 and 1, 1 being best.
Values scaled to range between
: *** p < .01, ** p < .05, * p < .10
Significance levels0.79.
AUC 0.29. MAE
plot(pred)
During the development process, we decided to allow the user to choose what ‘tree-reader’ function he would use, particularly between using either the rncl R package or ape. For such, we created a short benchmark that compares both functions here.