Some Sparse Methods for High Dimensional Data

authors

  • Saporta Gilbert
  • Bernard Anne
  • Bougeard Stéphanie
  • Niang Ndèye
  • Preda Cristian

keywords

  • High dimensional
  • Sparse
  • Clusterwise
  • PLS

abstract

High dimensional data means that the number of variables p is far larger than the number of observations n . This occurs in several fields such as genomic data or chemometrics. When p>n the OLS estimator does not exist for linear regression. Since it is a case of forced multicollinearity, one may use regularized methods such as ridge regression, principal component regression or PLS regression: these methods provide rather robust estimates through a dimension reduction approach or constraints on the regression coefficients. The fact that all the predictors are kept may be considered as a positive point in some cases. However if p>>n, it becomes a drawback since a combination of thousands of variables cannot be interpreted. Sparse combinations, ie with a large number of zero coefficients are preferred. Lasso, elastic net, sPLS perform simultaneously regularization and variable selection thanks to non quadratic penalties: L1, SCAD etc. Group-lasso is a generalization fitted to the case where explanatory variables are structured in blocks. Recent works include sparse discriminant analysis and sparse canonical correlation analysis. In PCA, the singular value decomposition shows that if we regress principal components onto the input variables, the vector of regression coefficients is equal to the factor loadings. It suffices to adapt sparse regression techniques to get sparse versions of PCA. Sparse Multiple Correspondence Analysis is derived from group-lasso with groups of indicator variables. Finally when one has a large number of observations, it is frequent that unobserved heterogeneity occurs, which means that there is no single model, but several local models: one for each cluster of a latent variable. Clusterwise methods optimize simultaneously the partition and the local models; they have been already extended to PLS regression. We will present here CS-PLS (Clusterwise Sparse PLS) a combination of clusterwise PLS and sPLS which is well fitted for big data: large n , large p.

more information