91ÆÆ½â°æ

Department of Computer Science at

91ÆÆ½â°æ

Department of Computer Science

In Partial Fulfillment of the Requirements for the Degree of
Master of Science

Mario Navas

Will defend his thesis

Efficient Computation of PCA with SVD
in a Relational Database System

Abstract

Principal Component Analysis (PCA) is one of the most common dimensionality reduction techniques with broad applications in data mining, statistics and signal processing. PCA finds a new set of orthogonal dimensions represented by linear combinations of input dimensions to project data into a new lower dimensionality space, preserving the variability existing on the original data. Given the mathematical complexity of PCA it has been traditionally computed outside a database system, forcing the user to export the data set. In this thesis we show it is feasible to solve PCA via Singular Value Decomposition (SVD) entirely inside a DBMS, without any external numerical analysis library. Our solution is based on dividing computation into two phases: one to derive a correlation matrix and and a second one to solve SVD using the correlation matrix as input. Based on such approach our method can efficiently analyze a large data set in a single pass, eliminating the need to export it and allowing the user to exploit a DBMS extensive functionality (e.g. querying, security). To solve SVD inside the DBMS, we introduce two basic solutions: one based exclusively on SQL queries and a second one based on User-Defined Functions for some key equations. Experimental evaluation shows our method can solve larger problems and in less time than state-of-the-art external statistical packages. In summary, our proposal extends a database system with PCA, a well-known and powerful data mining technique.

Date: Wednesday, July 29, 2009
Time: 2:00 PM
Place: 550-PGH
Faculty, students, and the general public are invited.
Advisor: Dr. Carlos Ordonez