Agglomerative partitioning

partition() reduces data while minimizing information loss using an agglomerative partitioning algorithm. The partition algorithm is fast and flexible: at every iteration, partition() uses an approach called Direct-Measure-Reduce (see Details) to create new variables that maintain the user-specified minimum level of information. Each reduced variable is also interpretable: the original variables map to one and only one variable in the reduced data set.

partition(
  .data,
  threshold,
  partitioner = part_icc(),
  tolerance = 1e-04,
  niter = NULL,
  x = "reduced_var",
  .sep = "_"
)

Arguments

.data: a data.frame to partition
threshold: the minimum proportion of information explained by a reduced variable; threshold sets a boundary for information loss because each reduced variable must explain at least as much as threshold as measured by the metric.
partitioner: a partitioner. See the part_*() functions and as_partitioner().
tolerance: a small tolerance within the threshold; if a reduction is within the threshold plus/minus the tolerance, it will reduce.
niter: the number of iterations. By default, it is calculated as 20% of the number of variables or 10, whichever is larger.
x: the prefix of the new variable names
.sep: a character vector that separates x from the number (e.g. "reduced_var_1").

Value

a partition object

Details

partition() uses an approach called Direct-Measure-Reduce. Directors tell the partition algorithm what to reduce, metrics tell it whether or not there will be enough information left after the reduction, and reducers tell it how to reduce the data. Together these are called a partitioner. The default partitioner for partition() is part_icc(): it finds pairs of variables to reduce by finding the pair with the minimum distance between them, it measures information loss through ICC, and it reduces data using scaled row means. There are several other partitioners available (part_*() functions), and you can create custom partitioners with as_partitioner() and replace_partitioner().

References

Millstein, Joshua, Francesca Battaglin, Malcolm Barrett, Shu Cao, Wu Zhang, Sebastian Stintzing, Volker Heinemann, and Heinz-Josef Lenz. 2020. “Partition: A Surjective Mapping Approach for Dimensionality Reduction.” Bioinformatics 36 (3): https://doi.org/676–81.10.1093/bioinformatics/btz661.

Barrett, Malcolm and Joshua Millstein (2020). partition: A fast and flexible framework for data reduction in R. Journal of Open Source Software, 5(47), 1991, https://doi.org/10.21105/joss.01991

Examples


set.seed(123)
df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 100)

#  don't accept reductions where information < .6
prt <- partition(df, threshold = .6)
prt
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 2 reduced variables created from 5 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block3_x1, block3_x5}
#> reduced_var_2 = {block2_x1, block2_x2, block2_x3}
#> 
#> Minimum information:
#> 0.627

# return reduced data
partition_scores(prt)
#> # A tibble: 100 × 9
#>    block1_x1 block1_x2 block1_x3 block2_x4 block3_x2 block3_x3 block3_x4
#>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#>  1    -0.441    -0.327    0.503     -0.526     0.203  -0.907     -0.919 
#>  2    -0.180    -0.584    0.490     -1.71     -0.249  -1.39      -0.398 
#>  3     0.376     0.158   -0.0732     0.693    -0.554  -1.52       0.714 
#>  4     1.10      1.54     0.564     -0.524    -0.585  -0.00592    0.299 
#>  5    -1.66     -1.25    -1.44       0.189    -1.69   -1.43       0.140 
#>  6     1.60      2.42     0.192      0.463    -1.26   -0.346     -1.86  
#>  7     1.40      0.236   -0.354     -0.313    -0.223  -1.13       0.0716
#>  8     2.21      2.41     1.73      -0.521     1.72    2.19       1.04  
#>  9     0.404     0.311    0.672     -0.572    -1.10   -0.0893    -1.55  
#> 10     0.199     0.348    0.0455    -0.408    -0.192  -0.355      0.223 
#> # ℹ 90 more rows
#> # ℹ 2 more variables: reduced_var_1 <dbl>, reduced_var_2 <dbl>

# access mapping keys
mapping_key(prt)
#> # A tibble: 9 × 4
#>   variable      mapping   information indices  
#>   <chr>         <list>          <dbl> <list>   
#> 1 block1_x1     <chr [1]>       1     <int [1]>
#> 2 block1_x2     <chr [1]>       1     <int [1]>
#> 3 block1_x3     <chr [1]>       1     <int [1]>
#> 4 block2_x4     <chr [1]>       1     <int [1]>
#> 5 block3_x2     <chr [1]>       1     <int [1]>
#> 6 block3_x3     <chr [1]>       1     <int [1]>
#> 7 block3_x4     <chr [1]>       1     <int [1]>
#> 8 reduced_var_1 <chr [2]>       0.656 <int [2]>
#> 9 reduced_var_2 <chr [3]>       0.627 <int [3]>
unnest_mappings(prt)
#> # A tibble: 12 × 4
#>    variable      mapping   information indices
#>    <chr>         <chr>           <dbl>   <int>
#>  1 block1_x1     block1_x1       1           1
#>  2 block1_x2     block1_x2       1           2
#>  3 block1_x3     block1_x3       1           3
#>  4 block2_x4     block2_x4       1           7
#>  5 block3_x2     block3_x2       1           9
#>  6 block3_x3     block3_x3       1          10
#>  7 block3_x4     block3_x4       1          11
#>  8 reduced_var_1 block3_x1       0.656       8
#>  9 reduced_var_1 block3_x5       0.656      12
#> 10 reduced_var_2 block2_x1       0.627       4
#> 11 reduced_var_2 block2_x2       0.627       5
#> 12 reduced_var_2 block2_x3       0.627       6

# use a lower threshold of information loss
partition(df, threshold = .5, partitioner = part_kmeans())
#> Partitioner:
#>    Director: <custom director> 
#>    Metric: <custom metric> 
#>    Reducer: <custom reducer>
#> 
#> Reduced Variables:
#> 2 reduced variables created from 6 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x1, block2_x2, block2_x3, block2_x4}
#> reduced_var_2 = {block3_x1, block3_x5}
#> 
#> Minimum information:
#> 0.59

# use a custom partitioner
part_icc_rowmeans <- replace_partitioner(part_icc, reduce = as_reducer(rowMeans))
partition(df, threshold = .6, partitioner = part_icc_rowmeans)
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: <custom reducer>
#> 
#> Reduced Variables:
#> 2 reduced variables created from 5 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block3_x1, block3_x5}
#> reduced_var_2 = {block2_x1, block2_x2, block2_x3}
#> 
#> Minimum information:
#> 0.627

Arguments

Value

Details

References

See also

Examples