Efficient and effective pattern detection for high dimensional microarray data

Linda Emujakporue, Tennessee State University

Abstract

Microarray experiments involve the measurement of the expression level of many thousands of genes in biological samples. Gene expression data usually live in a high dimension world, where each gene or sample represents one dimension in the data. Fundamental task in microarray data analysis is to find the signatures or patterns in gene expression data. Clustering can be used to show the relationships among samples or genes or both, and reveal the patterns (such as normal vs diseased cells) in them. It is challenging to cluster the microarray date due to the nature of redundancy and high dimensionality in the data. In this thesis, efficient and effective approaches which combine Principal Component Analysis (PCA) and k-mean clustering are explored. First, Principal Component Analysis (PCA) is investigated. PCA is showed powerful in its ability to represent complex gene expression data sets succinctly and it is used reduce the dimensionality of data sets and remove redundant data without loss of important information. Then, a novel clustering algorithm is proposed. The k-mean clustering is well known technique for clustering data to k clusters. However, it requires a pre-defined k. In order to find the optimal clustering, a new function f(k) is mathematically defined for the quality of k clusters. The best k can be decided by the highest value of f(k) among all values of k. Instead of exhaustive search on all values of k which requires O(log k*T) time, where T is the time complexity for k-mean clustering, a ternary search algorithm is proposed. It is shown that if f(k) is unimodal function, the best k-means clustering can be done in O(log k*T) time. The testing results in comparison with NbClust which is great R tool in estimating the best k in a dataset using 26 criteria also show that the proposed approach is efficient and effective.

Subject Area

Systematic|Bioinformatics

Recommended Citation

Linda Emujakporue, "Efficient and effective pattern detection for high dimensional microarray data" (2014). ETD Collection for Tennessee State University. Paper AAI1599457.
https://digitalscholarship.tnstate.edu/dissertations/AAI1599457

Share

COinS