Clustering Functions

In exploratory data analysis, one of the most common approaches is clustering: grouping similar elements in a collection. In pharmacometrics, this can be usually be applied to subjects or variables (covariates, biomarkers, etc.) associated with them. Accordingly, DeepPumas.jl provides the functions cluster_subjects and cluster_variables, respectively. Also, ClusterResults, medoids, cluster_count and total_cost are provided as utilities.

DeepPumas.cluster_subjects Function

julia

cluster_subjects(
    popDF::DataFrame,
    variables::Union{Symbol, Vector{Symbol}},
    k::Integer;
    standardize,
    baseline,
    init,
    maxiter,
    tol,
    display,
)

Group subjects in population popDF into k clusters based on one or more variables (covariates, biomarkers, etc.) using K-Medoids clustering. The pairwise distance matrix uses dynamic time warping (handles varying numbers of measurements). Variables have to be numeric, finite and without missing values. A ClusterResults object is returned. Other keyword arguments:

standardize = true: standardize each variable
baseline = falses(length(variables)): per variable, indicate if only baseline values should be used, which are taken from the first row associated with each subject
init = :kmpp: initialization of medoids. Can be a vector of k subject IDs, or a Symbol indicating a seeding algorithm. For more details see Clustering.kmedoids
maxiter = 200: maximum number of iterations
tol = 1e-8: minimum change in objective value until convergence
display = :none: verbosity. :none shows nothing. :final summarizes results after clustering. :iter shows the progress at each iteration.

The related function cluster_variables is used to cluster similar variables (e.g., covariates, biomarkers) in a population. See also medoids, cluster_count, total_cost.

DeepPumas.cluster_variables Function

julia

cluster_variables(
    popDF::DataFrame,
    variables::Vector{Symbol},
    k::Integer;
    standardize,
    baseline,
    init,
    maxiter,
    tol,
    display,
    cluster_negative,
)

Group variables (covariates, biomarkers, etc.) in population popDF into k clusters using K-Medoids clustering. The pairwise distance matrix uses dynamic time warping (handles varying numbers of measurements). Variables have to be numeric, finite and without missing values. Other keyword arguments:

standardize = true: standardize each variable
baseline = falses(length(variables)): per variable, indicate if only baseline values should be used, which are taken from the first row associated with each subject
init = :kmpp: initialization of medoids. Can be a vector of k variable indices, or a Symbol indicating a seeding algorithm. For more details see Clustering.kmedoids
maxiter = 200: maximum number of iterations
tol = 1e-8: minimum change in objective value until convergence
display = :none: verbosity. :none shows nothing. :final summarizes results after clustering. :iter shows the progress at each iteration
cluster_negative = false: if variables negatively correlated should be clustered together.

The related function cluster_subjects clusters subjects according to the given variables. See also medoids, cluster_count, total_cost and ClusterResults.

DeepPumas.ClusterResults Type

julia

ClusterResults

Object returned by cluster_subjects and cluster_variables. Contains the following fields:

assignments: DataFrame with columns subject (or variable), cluster (assignments), cost (distance from point to cluster medoid), cluster_center (medoid of respective cluster)
iterations: number of iterations the algorithm ran for
converged: boolean informing if algorithm converged or not.

DeepPumas.medoids Function

julia

medoids(cr::ClusterResults)

Return medoids of clusters.

DeepPumas.cluster_count Function

julia

cluster_count(cr::ClusterResults)

Return number of elements in each cluster.

DeepPumas.total_cost Function

julia

total_cost(cr::ClusterResults)

Return sum of distances from each element to the medoid of its cluster.

Clustering Functions ​

Clustering Functions