HOME Protein Structure Prediction MUFOLD-LOOP MUFOLD-DB Clustering Members

MUFOLD-CL for Big Data

With the fast accumulation of various data in large scale, data mining plays important roles in discovering new knowledge and patterns from big data, which is usually very "tall" (large number of data points) and "wide" (high dimensional). Clustering is one of the most important techniques used in large scale data mining. Because of the large size and high dimensions of the data, most of current clustering methods break down due to their low accuracy and long running time. To address this issue, we proposed a fast projection based clustering method. Unlike traditional subspace projected clustering algorithms, we projected the data points to the centroid, which was selected by a purifying and expanding strategy. The similarity between two data points can then be calculated through their projections on the centroid. Our method achieved linear time complexity with respect to the number of data points. We have compared the method with K-Means on several very large datasets in terms of clustering accuracy and running time. The significantly better accuracy and less running time demonstrated our method can serve as a valuable tool for big data clustering. We also compared the performance to a number of state-of-the-art clustering methods on smaller datasets. Our method was fastest and achieved comparable accuracy. The method has been implemented in MUFOLD-CL, which is available for download at http://mufold.org/clustering.php.
Files Specification
MUFOLD_CL Linux version of MUFOLD-CL for big data
Example Data Data for examples
Readme The help information
Reference: Yun Wu, Zhiquan He, Hao Lin, Yufei Zheng, Jingfen Zhang, Dong Xu: A fast projection-based algorithm for clustering big data Under Review

MUFOLD-CL for Protein Structure (Old Version)

Current protein structure prediction methods often generate a large population of candidates (models), and then select near-native models through clustering. Existing structural model clustering methods are time consuming due to pairwise distance calculation between models. We developed a novel method for fast model clustering without losing the clustering accuracy. Instead of the commonly used pairwise RMSD and TM-score values, we propose two new distance measures, Dscore1 and Dscore2, based on the comparison of the protein distance matrices for describing the difference and the similarity among models, respectively. The analysis indicates that both the correlation between Dscore1 and RMSD and the correlation between Dscore2 and TM-score are high. Our Dscore1-based clustering achieves a calculation time linearly proportional to the number of models while obtaining almost the same accuracy for near-native model selection in comparison to existing methods with calculation time quadratic to the number of models. By using Dscore2 to select representatives of clusters, we can further improve the quality of the representatives with little increase in computing time. In addition, for large size (~500k) of models, we can give a fast data visualization based on the Dscore distribution in seconds to minutes. Our method has been implemented in a package named MUFOLD-CL. The executable codes and some examples can be downloaded by the following linkages.
Files Specification
MUFOLD_CL_ProteinStructure Linux version. Tools for structural decoys clustering
Decoys Decoys for examples
Natives Native structures for examples
Readme The help information
Reference: Jingfen Zhang, Dong Xu: Fast algorithm for population-based protein structural model analysis. Proteomics. 2013 Jan 1;13(2):221-9