.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/pcovr/PCovR-WHODataset.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_pcovr_PCovR-WHODataset.py: The Benefits of Kernel PCovR for the WHO Dataset ================================================ .. GENERATED FROM PYTHON SOURCE LINES 9-24 .. code-block:: Python import numpy as np from matplotlib import pyplot as plt from scipy.stats import pearsonr from sklearn.decomposition import PCA, KernelPCA from sklearn.kernel_ridge import KernelRidge from sklearn.linear_model import Ridge, RidgeCV from sklearn.metrics import r2_score from sklearn.model_selection import GridSearchCV, train_test_split from skmatter.datasets import load_who_dataset from skmatter.decomposition import KernelPCovR, PCovR from skmatter.preprocessing import StandardFlexibleScaler .. GENERATED FROM PYTHON SOURCE LINES 25-27 Load the Dataset ^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 28-33 .. code-block:: Python df = load_who_dataset()["data"] print(df) .. rst-class:: sphx-glr-script-out .. code-block:: none Country Year ... SN.ITK.DEFC.ZS NY.GDP.PCAP.CD 0 Afghanistan 2005 ... 36.1 255.055120 1 Afghanistan 2006 ... 33.3 274.000486 2 Afghanistan 2007 ... 29.8 375.078128 3 Afghanistan 2008 ... 26.5 387.849174 4 Afghanistan 2009 ... 23.3 443.845151 ... ... ... ... ... ... 2015 South Africa 2015 ... 5.2 6204.929901 2016 South Africa 2016 ... 5.4 5735.066787 2017 South Africa 2017 ... 5.5 6734.475153 2018 South Africa 2018 ... 5.7 7048.522211 2019 South Africa 2019 ... 6.3 6688.787271 [2020 rows x 12 columns] .. GENERATED FROM PYTHON SOURCE LINES 35-50 .. code-block:: Python columns = [ "SP.POP.TOTL", "SH.TBS.INCD", "SH.IMM.MEAS", "SE.XPD.TOTL.GD.ZS", "SH.DYN.AIDS.ZS", "SH.IMM.IDPT", "SH.XPD.CHEX.GD.ZS", "SN.ITK.DEFC.ZS", "NY.GDP.PCAP.CD", ] X_raw = np.array(df[columns]) .. GENERATED FROM PYTHON SOURCE LINES 51-52 Below, we take the logarithm of the population and GDP to avoid extreme distributions .. GENERATED FROM PYTHON SOURCE LINES 53-63 .. code-block:: Python log_scaled = ["SP.POP.TOTL", "NY.GDP.PCAP.CD"] for ls in log_scaled: print(X_raw[:, columns.index(ls)].min(), X_raw[:, columns.index(ls)].max()) if ls in columns: X_raw[:, columns.index(ls)] = np.log10(X_raw[:, columns.index(ls)]) y_raw = np.array(df["SP.DYN.LE00.IN"]) y_raw = y_raw.reshape(-1, 1) X_raw.shape .. rst-class:: sphx-glr-script-out .. code-block:: none 149841.0 7742681934.0 110.460874721483 123678.70214327476 (2020, 9) .. GENERATED FROM PYTHON SOURCE LINES 64-66 Scale and Center the Features and Targets ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 67-80 .. code-block:: Python x_scaler = StandardFlexibleScaler(column_wise=True) X = x_scaler.fit_transform(X_raw) y_scaler = StandardFlexibleScaler(column_wise=True) y = y_scaler.fit_transform(y_raw) n_components = 2 X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, shuffle=True, random_state=0 ) .. GENERATED FROM PYTHON SOURCE LINES 81-88 Train the Different Linear DR Techniques ---------------------------------------- Below, we obtain the regression errors using a variety of linear DR techniques. Linear Regression ^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 89-94 .. code-block:: Python RidgeCV(cv=5, alphas=np.logspace(-8, 2, 20), fit_intercept=False).fit( X_train, y_train ).score(X_test, y_test) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.8548848257886271 .. GENERATED FROM PYTHON SOURCE LINES 95-97 PCovR ^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 98-117 .. code-block:: Python pcovr = PCovR( n_components=n_components, regressor=Ridge(alpha=1e-4, fit_intercept=False), mixing=0.5, random_state=0, ).fit(X_train, y_train) T_train_pcovr = pcovr.transform(X_train) T_test_pcovr = pcovr.transform(X_test) T_pcovr = pcovr.transform(X) r_pcovr = Ridge(alpha=1e-4, fit_intercept=False, random_state=0).fit( T_train_pcovr, y_train ) yp_pcovr = r_pcovr.predict(T_test_pcovr).reshape(-1, 1) plt.scatter(y_scaler.inverse_transform(y_test), y_scaler.inverse_transform(yp_pcovr)) r_pcovr.score(T_test_pcovr, y_test) .. image-sg:: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_001.png :alt: PCovR WHODataset :srcset: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /home/docs/checkouts/readthedocs.org/user_builds/scikit-matter/envs/257/lib/python3.13/site-packages/skmatter/decomposition/_pcov.py:50: UserWarning: This class does not automatically center data, and your data mean is greater than the supplied tolerance. warnings.warn( 0.8267220275787428 .. GENERATED FROM PYTHON SOURCE LINES 118-120 PCA ^^^ .. GENERATED FROM PYTHON SOURCE LINES 121-137 .. code-block:: Python pca = PCA( n_components=n_components, random_state=0, ).fit(X_train, y_train) T_train_pca = pca.transform(X_train) T_test_pca = pca.transform(X_test) T_pca = pca.transform(X) r_pca = Ridge(alpha=1e-4, fit_intercept=False, random_state=0).fit(T_train_pca, y_train) yp_pca = r_pca.predict(T_test_pca).reshape(-1, 1) plt.scatter(y_scaler.inverse_transform(y_test), y_scaler.inverse_transform(yp_pca)) r_pca.score(T_test_pca, y_test) .. image-sg:: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_002.png :alt: PCovR WHODataset :srcset: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none 0.8041174131375703 .. GENERATED FROM PYTHON SOURCE LINES 139-144 .. code-block:: Python for c, x in zip(columns, X.T): print(c, pearsonr(x, T_pca[:, 0])[0], pearsonr(x, T_pca[:, 1])[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none SP.POP.TOTL -0.22694404485361055 -0.3777743593940685 SH.TBS.INCD -0.6249287177098704 0.6316215151702456 SH.IMM.MEAS 0.842586228381343 0.13606904827472627 SE.XPD.TOTL.GD.ZS 0.41457342404840136 0.6100854823971251 SH.DYN.AIDS.ZS -0.3260933054303097 0.8499296260662148 SH.IMM.IDPT 0.8422637385674645 0.16339769662915174 SH.XPD.CHEX.GD.ZS 0.45900120895545243 0.30686303937881865 SN.ITK.DEFC.ZS -0.8212324937958553 0.055108835843951376 NY.GDP.PCAP.CD 0.8042167907410392 0.06566227478694868 .. GENERATED FROM PYTHON SOURCE LINES 145-157 Train the Different Kernel DR Techniques ---------------------------------------- Below, we obtain the regression errors using a variety of kernel DR techniques. Select Kernel Hyperparameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In the original publication, we used a cross-validated grid search to determine the best hyperparameters for the kernel ridge regression. We do not rerun this expensive search in this example but use the obtained parameters for ``gamma`` and ``alpha``. You may rerun the calculation locally by setting ``recalc=True``. .. GENERATED FROM PYTHON SOURCE LINES 158-178 .. code-block:: Python recalc = False if recalc: param_grid = {"gamma": np.logspace(-8, 3, 20), "alpha": np.logspace(-8, 3, 20)} clf = KernelRidge(kernel="rbf") gs = GridSearchCV(estimator=clf, param_grid=param_grid) gs.fit(X_train, y_train) gamma = gs.best_estimator_.gamma alpha = gs.best_estimator_.alpha else: gamma = 0.08858667904100832 alpha = 0.0016237767391887243 kernel_params = {"kernel": "rbf", "gamma": gamma} .. GENERATED FROM PYTHON SOURCE LINES 179-181 Kernel Regression ^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 182-186 .. code-block:: Python KernelRidge(**kernel_params, alpha=alpha).fit(X_train, y_train).score(X_test, y_test) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.9726524136785997 .. GENERATED FROM PYTHON SOURCE LINES 187-189 KPCovR ^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 190-209 .. code-block:: Python kpcovr = KernelPCovR( n_components=n_components, regressor=KernelRidge(alpha=alpha, **kernel_params), mixing=0.5, **kernel_params, ).fit(X_train, y_train) T_train_kpcovr = kpcovr.transform(X_train) T_test_kpcovr = kpcovr.transform(X_test) T_kpcovr = kpcovr.transform(X) r_kpcovr = KernelRidge(**kernel_params).fit(T_train_kpcovr, y_train) yp_kpcovr = r_kpcovr.predict(T_test_kpcovr) plt.scatter(y_scaler.inverse_transform(y_test), y_scaler.inverse_transform(yp_kpcovr)) r_kpcovr.score(T_test_kpcovr, y_test) .. image-sg:: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_003.png :alt: PCovR WHODataset :srcset: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none 0.9701003539460163 .. GENERATED FROM PYTHON SOURCE LINES 210-212 KPCA ^^^^ .. GENERATED FROM PYTHON SOURCE LINES 213-228 .. code-block:: Python kpca = KernelPCA(n_components=n_components, **kernel_params, random_state=0).fit( X_train, y_train ) T_train_kpca = kpca.transform(X_train) T_test_kpca = kpca.transform(X_test) T_kpca = kpca.transform(X) r_kpca = KernelRidge(**kernel_params).fit(T_train_kpca, y_train) yp_kpca = r_kpca.predict(T_test_kpca) plt.scatter(y_scaler.inverse_transform(y_test), y_scaler.inverse_transform(yp_kpca)) r_kpca.score(T_test_kpca, y_test) .. image-sg:: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_004.png :alt: PCovR WHODataset :srcset: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_004.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none 0.6661226058827727 .. GENERATED FROM PYTHON SOURCE LINES 229-231 Correlation of the different variables with the KPCovR axes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 232-236 .. code-block:: Python for c, x in zip(columns, X.T): print(c, pearsonr(x, T_kpcovr[:, 0])[0], pearsonr(x, T_kpcovr[:, 1])[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none SP.POP.TOTL 0.07320109486755187 0.03969226130174684 SH.TBS.INCD 0.6836177728806814 -0.05384746771432407 SH.IMM.MEAS -0.6604939713030802 0.047519698518210675 SE.XPD.TOTL.GD.ZS -0.23009788930020397 -0.3622748865999962 SH.DYN.AIDS.ZS 0.5157981075022208 -0.1170132700029201 SH.IMM.IDPT -0.6449500965012953 0.05262226781868083 SH.XPD.CHEX.GD.ZS -0.38019935560127377 -0.5736426627623917 SN.ITK.DEFC.ZS 0.7301250686596462 0.04793454286747634 NY.GDP.PCAP.CD -0.82286600973303 -0.49386365697113266 .. GENERATED FROM PYTHON SOURCE LINES 237-239 Plot Our Results ---------------- .. GENERATED FROM PYTHON SOURCE LINES 240-315 .. code-block:: Python fig, axes = plt.subplot_mosaic( """ AFF.B A.GGB ..... CHH.D C.IID ..... EEEEE """, figsize=(7.5, 7.5), gridspec_kw=dict( height_ratios=(0.5, 0.5, 0.1, 0.5, 0.5, 0.1, 0.1), width_ratios=(1, 0.1, 0.2, 0.1, 1), ), ) axPCA, axPCovR, axKPCA, axKPCovR = axes["A"], axes["B"], axes["C"], axes["D"] axPCAy, axPCovRy, axKPCAy, axKPCovRy = axes["F"], axes["G"], axes["H"], axes["I"] def add_subplot(ax, axy, T, yp, let=""): """Adding a subplot to a given axis.""" p = ax.scatter(-T[:, 0], T[:, 1], c=y_raw, s=4) ax.set_xticks([]) ax.set_yticks([]) ax.annotate( xy=(0.025, 0.95), xycoords="axes fraction", text=f"({let})", va="top", ha="left" ) axy.scatter( y_scaler.inverse_transform(y_test), y_scaler.inverse_transform(yp), c="k", s=1, ) axy.plot([y_raw.min(), y_raw.max()], [y_raw.min(), y_raw.max()], "r--") axy.annotate( xy=(0.05, 0.95), xycoords="axes fraction", text=r"R$^2$=%0.2f" % round(r2_score(y_test, yp), 3), va="top", ha="left", fontsize=8, ) axy.set_xticks([]) axy.set_yticks([]) return p p = add_subplot(axPCA, axPCAy, T_pca, yp_pca, "a") axPCA.set_xlabel("PC$_1$") axPCA.set_ylabel("PC$_2$") add_subplot(axPCovR, axPCovRy, T_pcovr @ np.diag([-1, 1]), yp_pcovr, "b") axPCovR.yaxis.set_label_position("right") axPCovR.set_xlabel("PCov$_1$") axPCovR.set_ylabel("PCov$_2$", rotation=-90, va="bottom") add_subplot(axKPCA, axKPCAy, T_kpca @ np.diag([-1, 1]), yp_kpca, "c") axKPCA.set_xlabel("Kernel PC$_1$", fontsize=10) axKPCA.set_ylabel("Kernel PC$_2$", fontsize=10) add_subplot(axKPCovR, axKPCovRy, T_kpcovr, yp_kpcovr, "d") axKPCovR.yaxis.set_label_position("right") axKPCovR.set_xlabel("Kernel PCov$_1$", fontsize=10) axKPCovR.set_ylabel("Kernel PCov$_2$", rotation=-90, va="bottom", fontsize=10) plt.colorbar( p, cax=axes["E"], label="Life Expectancy [years]", orientation="horizontal" ) fig.subplots_adjust(wspace=0, hspace=0.4) fig.suptitle( "Linear and Kernel PCovR for Predicting Life Expectancy", y=0.925, fontsize=10 ) plt.show() .. image-sg:: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_005.png :alt: Linear and Kernel PCovR for Predicting Life Expectancy :srcset: /examples/pcovr/images/sphx_glr_PCovR-WHODataset_005.png :class: sphx-glr-single-img .. _sphx_glr_download_examples_pcovr_PCovR-WHODataset.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: PCovR-WHODataset.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: PCovR-WHODataset.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: PCovR-WHODataset.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_