Skip to contents

The R package penetrance provides an estimation of age-specific penetrance for complex family-based studies in a format compatible with with the PanelPRO R package.

This vignette demonstrates the process of age imputation and penetrance estimation using thepenetrance package based on family data and user-specified inputs.

Data

The user must specify the pedigree argument as a data frame that contains the family data (see test_family_1). The family data must be in the correct format with the following columns:

  • ID: A numeric value representing the unique identifier for each individual. There should be no duplicated entries.

  • Sex: A numeric value where 0 indicates female and 1 indicates male. Missing entries are not currently supported.

  • MotherID: A numeric value representing the unique identifier for an individual’s mother.

  • FatherID: A numeric value representing the unique identifier for an individual’s father.

  • isProband: A numeric value where 1 indicates the individual is a proband and 0 otherwise.

  • CurAge: A numeric value indicating the age of censoring (current age if the person is alive or age at death if the person is deceased). Allowed ages range from 1 to 94.

  • isAff: A numeric value indicating the affection status of cancer, with 1 for diagnosed individuals and 0 otherwise. Missing entries are not supported.

  • Age: A numeric value indicating the age of cancer diagnosis, encoded as NA if the individual was not diagnosed. Allowed ages range from 1 to 94.

  • Geno: A column for germline testing or tumor marker testing results. Positive results should be coded as 1, negative results as 0, and unknown results as NA or left empty.

Age Imputation

The PenEsim also includes an option for automatic age imputation AgeImputation. The imputation of ages is performed based on the individual’s affected status (aff), sex (sex), and their degree of relationship to the proband, who is a carrier of the pathogenic variant (PV).

To calculate the degree of relationship to the proband, we use the kinship matrix from the pedigree data. The degree of relationship between two individuals is twice the kinship coefficient. For affected individuals (aff = 1), if a randomly drawn value is less than the relationship probability, the age is drawn from a Weibull distribution. The Weibull distribution parameters for males and females are (α,β,δ)(\alpha, \beta, \delta). The quantile function of the Weibull distribution is used to draw the ages (separately for males and females). If the random draw exceeds the relationship probability, the age is drawn from the SEER data using the inverse CDF method. In this case, the individual is assumed to be a non-carrier of the PV. To calculate the empirical density for non-affected individuals, filter the dataset to include only non-affected individuals and estimate the density of their ages. Then randomly draw an age from this distribution for non-affected individuals.

Estimation Approach

The estimation assumes a Mendelian model with a single biallelic locus. The main function penetrance implements a Bayesian approach to estimate the parameters of the carrier penetrance function. The estimation routine employs an Adaptive Metropolis Hastings (MH) algorithm based on Haario et al., (2001). We build on existing software implementations of the Elston-Stewart peeling algorithm in the R package clipp to efficiently compute likelihoods.

To model the penetrance curve we choose the Weibull distribution which is widely used in reliability engineering and survival analysis. To provide further flexibility we extend the standard Weibull distribution to four parameters with an additional threshold parameter (δ\delta) to move the distribution along the age axis and an asymptote parameter (0<γ<10 < \gamma < 1) which allows for incomplete penetrance and is interpreted as the lifetime probability of developing cancer. Hence, we have a vector of parameters θ=(α,β,γ,δ)\theta = (\alpha, \beta, \gamma, \delta). Specifically, we model the density ff and corresponding cumulative distribution functions FF as:

f(x;α,β,δ,γ)={γ(αβ(xδβ)α1e(xδβ)α)xδ0x<δ f(x; \alpha, \beta, \delta, \gamma) = \begin{cases} \gamma \left( \frac{\alpha}{\beta} \left( \frac{x - \delta}{\beta} \right)^{\alpha - 1} e^{ -\left( \frac{x - \delta}{\beta} \right)^\alpha } \right) & x \geq \delta \\ 0 & x < \delta \end{cases} The MCMC estimation approach allows us to sample from the posterior distribution in an efficient way in order to infer the estimates of the parameters from the Weibull distribution.

Prior Specification

To run the MCMC algorithm, the prior distributions for θ=(α,β,γ,δ)\theta = (\alpha, \beta, \gamma, \delta) need to be specified. The package provides the user with a flexible approach to prior specification, balancing customization with an easy-to-use workflow. The following settings for the prior distribution specification are available: - Default: The default setting employs pre-specified, uninformative priors. - Custom Parameters: The user can tune the parameters of the default setting directly by overwriting the parameters in the prior_params_default object. - Prior Elicitation from Existing Studies: The user can input data from previous penetrance studies to automatically elicit customized priors. The default setting is adjusted to reflect the additional information based on the provided inputs.

# The default prior 
prior_params_default <- list(
    asymptote = list(g1 = 1, g2 = 1),
    threshold = list(min = 15, max = 35),
    median = list(m1 = 2, m2 = 2),
    first_quartile = list(q1 = 6, q2 = 3)
)

distribution_data_test <- data.frame(
  row.names = c("min", "first_quartile", "median", "max"),
  age = c(NA, NA, NA, NA),
  at_risk = c(NA, NA, NA, NA)
)

Estimation using penetrance

There are a few ways in which a user can specify how the estimation approach is run. Available options are:

#' @param twins Identical twins or triplets in the family can be specifed. For example, to indicate that `ora024` and `ora027` are identical twins, and so are `aey063` and `aey064`, 
#' then we can use the following as the twins arguement: twins <- list(c("ora024", "ora027"), c("aey063", "aey064"))
#' @param n_chains Number of chains for parallel computation. Default is 1.
#' @param n_iter_per_chain Number of iterations for each chain. Default is 10000.
#' @param ncores Number of cores for parallel computation. Default is 6.
#' @param baseline_data Data for the baseline risk estimates (probability of developing cancer), such as population-level risk from a cancer registry. Default is the allele frequency for MLH1 from the PanelPRO database.
#' @param max_age Maximum age considered for analysis. Default is 94.
#' @param removeProband Logical, indicating whether to remove probands from the analysis. Default is FALSE.
#' @param ageImputation Logical, indicating whether to perform age imputation. Default is FALSE.
#' @param median_max Boolean indicating whether to use the baseline median age or max_age as an upper bound for the median proposal. Default is TRUE.
#' @param BaselineNC Boolean indicating that the non-carrier penetrance is assumed to be the baseline penetrance. Default is TRUE.
#' @param var Vector of variances for the proposal distribution. Default is c(0.1, 0.1, 2, 2, 5, 5, 5, 5).
#' @param burn_in Fraction of results to discard as burn-in (0 to 1). Default is 0 (no burn-in).
#' @param thinning_factor Factor by which to thin the results. Default is 1 (no thinning).
#' @param distribution_data Data for generating prior distributions.
#' @param af Allele frequency for the risk allele. Default is 0.0001.
#' @param max_penetrance Maximum penetrance considered for analysis. Default is 1.
#' @param sample_size Optional sample size for distribution generation.
#' @param ratio Optional ratio parameter for distribution generation.
#' @param prior_params Parameters for prior distributions.
#' @param risk_proportion Proportion of risk for distribution generation.
#' @param summary_stats Boolean indicating whether to include summary statistics in the output. Default is TRUE.
#' @param rejection_rates Boolean indicating whether to include rejection rates in the output. Default is TRUE.
#' @param density_plots Boolean indicating whether to include density plots in the output. Default is TRUE.
#' @param plot_trace Boolean indicating whether to include trace plots in the output. Default is TRUE.
#' @param penetrance_plot Boolean indicating whether to include penetrance plots in the output. Default is TRUE.
#' @param penetrance_plot_pdf Boolean indicating whether to include PDF plots in the output. Default is TRUE.
#' @param probCI Probability level for confidence intervals in penetrance plots. Default is 0.95.
#' @param sex_specific Logical, indicating whether to use sex-specific parameters in the analysis. Default is TRUE.

Example

To run the algorithm, we require the user to input:

  • A single chain with 10k iterations of the adaptive MH algorithm. The number of cores for the paralellization is set to ncores = 4.
  • The pedigree data are the test family 1.
  • The allele frequency for the pathogenic variant is set at af = 0.0001.
  • The penetrance function takes baseline age-specific probabilitie of developing cancer as as input baseline_data. In the default setting with BaselineNC = TRUE this baseline is assumed to reflect the non-carrier penetrance. For rare mutations this is considered a reasonable assumption. The baseline_data must be a data frame with baseline penetrance for females and males.
# This is exemplary SEER penetrance data for Colorectal Cancer. 
baseline_data_default <- data.frame(
  Age = 1:94,
  Female = c(
    2.8e-07, 9e-08, 1e-08, 3e-08, 5e-08, 7e-08, 9e-08, 7e-08, 5e-08, 4e-08, 2e-08, 4e-08, 3.2e-07, 6.3e-07, 9.5e-07, 1.27e-06,
    1.94e-06, 4.73e-06, 7.88e-06, 1.102e-05, 1.417e-05, 1.916e-05, 3.517e-05, 5.303e-05, 7.088e-05, 8.873e-05, 0.00010961,
    1.486e-04, 1.906e-04, 0.00023261, 0.00027461, 0.00032101, 0.00039385, 0.00047109, 0.00054832, 0.00062554, 7.166e-04,
    0.00089081, 0.00107885, 0.00126684, 0.00145479, 0.00163827, 0.00179526, 0.00194779, 0.00210026, 0.00225264, 0.00239638,
    0.00248859, 0.00257215, 0.00265562, 0.00273901, 0.00281841, 0.00287439, 0.00292639, 0.00297831, 0.00303014, 0.00309486,
    0.00323735, 0.00339265, 0.00354775, 0.00370266, 0.00385852, 0.00402115, 0.0041847, 0.00434799, 4.511e-03, 0.00466219,
    0.00474384, 0.00481372, 0.00488337, 0.0049528, 0.00500362, 0.00494413, 0.00486620, 0.00478821, 0.00471016, 0.00462597,
    0.00450511, 0.00437818, 0.00425130, 0.00412450, 0.00399836, 0.00387577, 0.00375380, 0.00363187, 0.00351002, 0.00338186,
    0.00321541, 0.00304280, 0.00287050, 0.00269853, 0.00253244, 0.00239963, 0.00227262
  ),
  Male = c(
    2.8e-07, 9e-08, 1e-08, 3e-08, 5e-08, 7e-08, 9e-08, 7e-08, 5e-08, 4e-08, 2e-08, 4e-08, 3.2e-07, 6.3e-07, 9.5e-07, 1.27e-06,
    1.94e-06, 4.73e-06, 7.88e-06, 1.102e-05, 1.417e-05, 1.916e-05, 3.517e-05, 5.303e-05, 7.088e-05, 8.873e-05, 0.00010961,
    1.486e-04, 1.906e-04, 0.00023261, 0.00027461, 0.00032101, 0.00039385, 0.00047109, 0.00054832, 0.00062554, 7.166e-04,
    0.00089081, 0.00107885, 0.00126684, 0.00145479, 0.00163827, 0.00179526, 0.00194779, 0.00210026, 0.00225264, 0.00239638,
    0.00248859, 0.00257215, 0.00265562, 0.00273901, 0.00281841, 0.00287439, 0.00292639, 0.00297831, 0.00303014, 0.00309486,
    0.00323735, 0.00339265, 0.00354775, 0.00370266, 0.00385852, 0.00402115, 0.0041847, 0.00434799, 4.511e-03, 0.00466219,
    0.00474384, 0.00481372, 0.00488337, 0.0049528, 0.00500362, 0.00494413, 0.00486620, 0.00478821, 0.00471016, 0.00462597,
    0.00450511, 0.00437818, 0.00425130, 0.00412450, 0.00399836, 0.00387577, 0.00375380, 0.00363187, 0.00351002, 0.00338186,
    0.00321541, 0.00304280, 0.00287050, 0.00269853, 0.00253244, 0.00239963, 0.00227262
  )
)
  • The specified parameters for the prior distributions are as per the specification in the prior_params_default object (see above).
  • The burn-in is set to burn_in = 0.1, which means the first 1000 iterations of the chain will be discarded (given 10k iterations of the algorithm).
  • The median age of onset of the baseline (non-carrier) penentrance is set as the upper bound for the proposed median, given that median_max = TRUE.
  • The automatic age imputation AgeImputation is performed (see above), given that ageImputation = TRUE.
output <- penetrance(
    pedigree  = dat, twins = NULL, n_chains = 1, n_iter_per_chain = 20000, 
    ncores = 2, baseline_data = baseline_data_default, af  = 0.0001, 
    prior_params = prior_params_default, burn_in = 0.1, median_max = TRUE,  
    ageImputation = TRUE, sex_specific = TRUE
)

References

Andrieu C, Thoms J. A tutorial on adaptive MCMC. Stat Comput. 2008; 18(3):343–373. doi: 10.1007/s11222- 008-9110-y.

Golicher AR. clipp: Calculating Likelihoods by Pedigree Paring; 2023, https://CRAN.R-project.org/package=clipp, r package version 1.7.0.

Haario H, Saksman E, Tamminen J. An adaptive Metropolis algorithm. Bernoulli. 2001; 7(2):223–242.

Lange K, Elston RC. Extensions to pedigree analysis I. Likehood calculations for simple and complex pedigrees. Hum Hered. 1975;25(2):95-105.