Building a High Performance Query System for Crystallography Open Database

Xiaohong Zhao, Tennessee State University


Crystallography Open Database (COD) is an open-access database for crystal structures. The dataset of COD is large and new structures are continuously added to it day by day. COD stores data in Crystallographic Information File (CIF), one structure per file. There are two ways to query in COD database: COD Web interface and COD MySQL server. However, COD have some limitations: the COD Web interface can only offer limited search operations by using some predefined fields; many important structure parameters which are included in CIFs are not contained in the COD MySQL database; and the COD provides only sequential services for querying and data update which is not efficient for a large database. The goal of this thesis is to build a high performance COD (H-COD) database with efficient query system on distributed computer cluster. A step by step development approach is followed. First the original COD dataset is cleaned and unnecessary attributes are deleted. Second by using MySQL database management system, the tables of H-COD are designed and crystallographic structure data are inserted into these tables. Third the architecture of H-COD is built and the data are distributed in a distributed computer cluster. MPI is chosen as the parallel programing software to perform parallel queries. A web-based user interface using Django Web framework is developed. Finally the effectiveness and efficiency of H-COD are tested and evaluated. The test results show that the H-COD is effective and functional for all designed query operations. The H-COD is efficient and cost optimal, i.e., when using k processors the parallel query in the H-COD can speed up to k times of a sequential query.

Subject Area

Computer science

Recommended Citation

Xiaohong Zhao, "Building a High Performance Query System for Crystallography Open Database" (2016). ETD Collection for Tennessee State University. Paper AAI10119083.