super_partition implements the agglomerative, data reduction method Partition for datasets with large numbers of features by first 'super-partitioning' the data into smaller clusters to Partition.

super_partition(
  full_data,
  threshold = 0.5,
  cluster_size = 4000,
  partitioner = part_icc(),
  tolerance = 1e-04,
  niter = NULL,
  x = "reduced_var",
  .sep = "_",
  verbose = TRUE,
  progress_bar = TRUE
)

Arguments

full_data

sample by feature data frame or matrix

threshold

the minimum proportion of information explained by a reduced variable; threshold sets a boundary for information loss because each reduced variable must explain at least as much as threshold as measured by the metric.

cluster_size

maximum size of any single cluster; default is 4000

partitioner

a partitioner. See the part_*() functions and as_partitioner().

tolerance

a small tolerance within the threshold; if a reduction is within the threshold plus/minus the tolerance, it will reduce.

niter

the number of iterations. By default, it is calculated as 20% of the number of variables or 10, whichever is larger.

x

the prefix of the new variable names; must not be contained in any existing data names

.sep

a character vector that separates x from the number (e.g. "reduced_var_1").

verbose

logical for whether or not to display information about super partition step; default is TRUE

progress_bar

logical for whether or not to show progress bar; default is TRUE

Value

Partition object

Details

super_partition scales up partition with an approximation, using Genie, a fast, hierarchical clustering algorithm with similar qualities of those to Partition, to first super-partition the data into ceiling(N/c) clusters, where N is the number of features in the full dataset and c is the user-defined maximum cluster size (default value = 4,000). Then, if any cluster from the super-partition has a size greater than c, use Genie again on that cluster until all cluster sizes are less than c. Finally, apply the Partition algorithm to each of the super-partitions.

It may be the case that large super-partitions cannot be easily broken with Genie due to high similarity between features. In this case, we use k-means to break the cluster.

References

Barrett, Malcolm and Joshua Millstein (2020). partition: A fast and flexible framework for data reduction in R. Journal of Open Source Software, 5(47), 1991, https://doi.org/10.21105/joss.01991Millstein J, Battaglin F, Barrett M, Cao S, Zhang W, Stintzing S, et al. Partition: a surjective mapping approach for dimensionality reduction. Bioinformatics 36 (2019) 676–681. doi:10.1093/bioinformatics/btz661.

Gagolewski, Marek, Maciej Bartoszuk, and Anna Cena. "Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm." Information Sciences 363 (2016): 8-23.

Millstein, Joshua, Francesca Battaglin, Malcolm Barrett, Shu Cao, Wu Zhang, Sebastian Stintzing, Volker Heinemann, and Heinz-Josef Lenz. 2020. “Partition: A Surjective Mapping Approach for Dimensionality Reduction.” Bioinformatics 36 (3): https://doi.org/676–81.10.1093/bioinformatics/btz661.

See also

Author

Katelyn Queen, kjqueen@usc.edu

Examples


set.seed(123)
df <- simulate_block_data(c(15, 20, 10), lower_corr = .4, upper_corr = .6, n = 100)

#  don't accept reductions where information < .6
prt <- super_partition(df, threshold = .6, cluster_size = 30)
#> 2 super clusters identified. Beginning Partition.
#> Maximum cluster size: 30
prt
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 9 reduced variables created from 32 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block1_x8, block1_x12}
#> reduced_var_2 = {block1_x3, block1_x6}
#> reduced_var_3 = {block2_x12, block2_x14}
#> reduced_var_4 = {block2_x6, block2_x15}
#> reduced_var_5 = {block3_x4, block3_x5}
#> reduced_var_6 = {block1_x4, block1_x10, block1_x15}
#> reduced_var_7 = {block3_x1, block3_x8, block3_x9}
#> reduced_var_8 = {block1_x1, block1_x2, block1_x7, block1_x11, block1_x14}
#> reduced_var_9 = {block2_x2, block2_x3, block2_x4, block2_x5, block2_x7, and 6 more variables}
#> 
#> Minimum information:
#> 0.601