Super Partitioning of Variables Based on Correlation
suppar.Rd
This function identifies and groups highly correlated variables into modules from a given dataset using a series of correlation computations stored across temporary files. It utilizes a hierarchical chunk processing method to handle large datasets.
Usage
suppar(
tmp,
thresh = NULL,
n.chunkf = 10000,
B = 2000,
compute.corr = TRUE,
dist.thresh = NULL,
dir.tmp
)
Arguments
- tmp
A data frame or matrix of data to be analyzed.
- thresh
Numeric vector; thresholds for defining the correlation strength necessary to consider two variables as connected or dependent. The function creates modules of variables that have correlations above these thresholds.
- n.chunkf
Integer; the number of features to process per chunk in the correlation analysis.
- B
Integer; the maximum size of a module. If a module reaches this size, it will not be merged with other modules even if its members are correlated with members of another module.
- compute.corr
Logical; should the correlation be computed (TRUE) or pre-computed correlations be used (FALSE).
- dist.thresh
Optional; a distance threshold to apply before computing correlations, allowing for spatial constraints on correlation computation.
- dir.tmp
Directory path where temporary correlation files are stored.
Value
A list containing two elements: - A list of character vectors, where each vector contains the names of variables that form a module. - A character vector of independent variables not included in any module.
Details
suppar
function starts by setting up the environment and preparing the data.
If compute.corr
is TRUE, it computes the correlation and stores the results in
temporary files in dir.tmp
. It then loads these files one by one, aggregates
correlated variables into groups using the partagg
function, and finally,
it cleans up the temporary files.
The corfun1
and corfun2
are helper functions called within suppar
to manage
the computation of correlations in chunks and saving those in a manageable manner,
which helps in processing large datasets without overwhelming memory resources.
partagg
, an Rcpp function, efficiently processes and aggregates variables into
modules based on the correlation data read from the temporary files. It ensures
that the size of any module does not exceed the B
parameter.