Current protein structure prediction methods often generate a large population of candidates (models), and then select nearnative models through clustering. Existing structural model clustering methods are time consuming due to pairwise distance calculation between models. We developed a novel method for fast model clustering without losing the clustering accuracy. Instead of the commonly used pairwise RMSD and TMscore values, we propose two new distance measures, Dscore1 and Dscore2, based on the comparison of the protein distance matrices for describing the difference and the similarity among models, respectively. The analysis indicates that both the correlation between Dscore1 and RMSD and the correlation between Dscore2 and TMscore are high. Our Dscore1based clustering achieves a calculation time linearly proportional to the number of models while obtaining almost the same accuracy for nearnative model selection in comparison to existing methods with calculation time quadratic to the number of models. By using Dscore2 to select representatives of clusters, we can further improve the quality of the representatives with little increase in computing time. In addition, for large size (~500k) of models, we can give a fast data visualization based on the Dscore distribution in seconds to minutes. Our method has been implemented in a package named MUFOLDCL. The executable codes and some examples can be downloaded by the following linkages.
