Skip to content

Clustering Functions

In exploratory data analysis, one of the most common approaches is clustering: grouping similar elements in a collection. In pharmacometrics, this can be usually be applied to subjects or variables (covariates, biomarkers, etc.) associated with them. Accordingly, DeepPumas.jl provides the functions cluster_subjects and cluster_variables, respectively. Also, ClusterResults, medoids, cluster_count and total_cost are provided as utilities.

DeepPumas.cluster_subjects Function
julia
cluster_subjects(
    popDF::DataFrame,
    variables::Union{Symbol, Vector{Symbol}},
    k::Integer;
    standardize,
    baseline,
    init,
    maxiter,
    tol,
    display,
)

Group subjects in population popDF into k clusters based on one or more variables (covariates, biomarkers, etc.) using K-Medoids clustering. The pairwise distance matrix uses dynamic time warping (handles varying numbers of measurements). Variables have to be numeric, finite and without missing values. A ClusterResults object is returned. Other keyword arguments:

  • standardize = true: standardize each variable

  • baseline = falses(length(variables)): per variable, indicate if only baseline values should be used, which are taken from the first row associated with each subject

  • init = :kmpp: initialization of medoids. Can be a vector of k subject IDs, or a Symbol indicating a seeding algorithm. For more details see Clustering.kmedoids

  • maxiter = 200: maximum number of iterations

  • tol = 1e-8: minimum change in objective value until convergence

  • display = :none: verbosity. :none shows nothing. :final summarizes results after clustering. :iter shows the progress at each iteration.

The related function cluster_variables is used to cluster similar variables (e.g., covariates, biomarkers) in a population. See also medoids, cluster_count, total_cost.

DeepPumas.cluster_variables Function
julia
cluster_variables(
    popDF::DataFrame,
    variables::Vector{Symbol},
    k::Integer;
    standardize,
    baseline,
    init,
    maxiter,
    tol,
    display,
    cluster_negative,
)

Group variables (covariates, biomarkers, etc.) in population popDF into k clusters using K-Medoids clustering. The pairwise distance matrix uses dynamic time warping (handles varying numbers of measurements). Variables have to be numeric, finite and without missing values. Other keyword arguments:

  • standardize = true: standardize each variable

  • baseline = falses(length(variables)): per variable, indicate if only baseline values should be used, which are taken from the first row associated with each subject

  • init = :kmpp: initialization of medoids. Can be a vector of k variable indices, or a Symbol indicating a seeding algorithm. For more details see Clustering.kmedoids

  • maxiter = 200: maximum number of iterations

  • tol = 1e-8: minimum change in objective value until convergence

  • display = :none: verbosity. :none shows nothing. :final summarizes results after clustering. :iter shows the progress at each iteration

  • cluster_negative = false: if variables negatively correlated should be clustered together.

The related function cluster_subjects clusters subjects according to the given variables. See also medoids, cluster_count, total_cost and ClusterResults.

DeepPumas.ClusterResults Type
julia
ClusterResults

Object returned by cluster_subjects and cluster_variables. Contains the following fields:

  • assignments: DataFrame with columns subject (or variable), cluster (assignments), cost (distance from point to cluster medoid), cluster_center (medoid of respective cluster)

  • iterations: number of iterations the algorithm ran for

  • converged: boolean informing if algorithm converged or not.

See also medoids, cluster_count, total_cost.

DeepPumas.medoids Function
julia
medoids(cr::ClusterResults)

Return medoids of clusters.

See also cluster_variables, cluster_subjects and DeepPumas.ClusterResults.

DeepPumas.cluster_count Function
julia
cluster_count(cr::ClusterResults)

Return number of elements in each cluster.

See also cluster_variables, cluster_subjects and ClusterResults.

DeepPumas.total_cost Function
julia
total_cost(cr::ClusterResults)

Return sum of distances from each element to the medoid of its cluster.

See also cluster_variables, cluster_subjects and ClusterResults.