TY - JOUR
T1 - Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data
AU - Yoshida, Kosuke
AU - Yoshimoto, Junichiro
AU - Doya, Kenji
N1 - Publisher Copyright:
© 2017 The Author(s).
PY - 2017/2/14
Y1 - 2017/2/14
N2 - Background: Advance in high-throughput technologies in genomics, transcriptomics, and metabolomics has created demand for bioinformatics tools to integrate high-dimensional data from different sources. Canonical correlation analysis (CCA) is a statistical tool for finding linear associations between different types of information. Previous extensions of CCA used to capture nonlinear associations, such as kernel CCA, did not allow feature selection or capturing of multiple canonical components. Here we propose a novel method, two-stage kernel CCA (TSKCCA) to select appropriate kernels in the framework of multiple kernel learning. Results: TSKCCA first selects relevant kernels based on the HSIC criterion in the multiple kernel learning framework. Weights are then derived by non-negative matrix decomposition with L1 regularization. Using artificial datasets and nutrigenomic datasets, we show that TSKCCA can extract multiple, nonlinear associations among high-dimensional data and multiplicative interactions among variables. Conclusions: TSKCCA can identify nonlinear associations among high-dimensional data more reliably than previous nonlinear CCA methods.
AB - Background: Advance in high-throughput technologies in genomics, transcriptomics, and metabolomics has created demand for bioinformatics tools to integrate high-dimensional data from different sources. Canonical correlation analysis (CCA) is a statistical tool for finding linear associations between different types of information. Previous extensions of CCA used to capture nonlinear associations, such as kernel CCA, did not allow feature selection or capturing of multiple canonical components. Here we propose a novel method, two-stage kernel CCA (TSKCCA) to select appropriate kernels in the framework of multiple kernel learning. Results: TSKCCA first selects relevant kernels based on the HSIC criterion in the multiple kernel learning framework. Weights are then derived by non-negative matrix decomposition with L1 regularization. Using artificial datasets and nutrigenomic datasets, we show that TSKCCA can extract multiple, nonlinear associations among high-dimensional data and multiplicative interactions among variables. Conclusions: TSKCCA can identify nonlinear associations among high-dimensional data more reliably than previous nonlinear CCA methods.
KW - Hilbert-Schmidt independent criterion
KW - Kernel canonical correlation analysis
KW - L1 regularization
UR - http://www.scopus.com/inward/record.url?scp=85012884938&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85012884938&partnerID=8YFLogxK
U2 - 10.1186/s12859-017-1543-x
DO - 10.1186/s12859-017-1543-x
M3 - Article
C2 - 28196464
AN - SCOPUS:85012884938
SN - 1471-2105
VL - 18
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - 1
M1 - 108
ER -