Feature analysis for classification of trace fluorescent labeled protein crystallization images

Sigdel, Madhav; Dinc, Imren; Sigdel, Madhu S.; Dinc, Semih; Pusey, Marc L.; Aygun, Ramazan S.

doi:10.1186/s13040-017-0133-9

Research
Open access
Published: 27 April 2017

Feature analysis for classification of trace fluorescent labeled protein crystallization images

Madhav Sigdel¹,
Imren Dinc²,
Madhu S. Sigdel¹,
Semih Dinc¹,
Marc L. Pusey³ &
…
Ramazan S. Aygun¹

BioData Mining volume 10, Article number: 14 (2017) Cite this article

3578 Accesses
12 Citations
1 Altmetric
Metrics details

Abstract

Background

Large number of features are extracted from protein crystallization trial images to improve the accuracy of classifiers for predicting the presence of crystals or phases of the crystallization process. The excessive number of features and computationally intensive image processing methods to extract these features make utilization of automated classification tools on stand-alone computing systems inconvenient due to the required time to complete the classification tasks. Combinations of image feature sets, feature reduction and classification techniques for crystallization images benefiting from trace fluorescence labeling are investigated.

Results

Features are categorized into intensity, graph, histogram, texture, shape adaptive, and region features (using binarized images generated by Otsu’s, green percentile, and morphological thresholding). The effects of normalization, feature reduction with principle components analysis (PCA), and feature selection using random forest classifier are also analyzed. The time required to extract feature categories is computed and an estimated time of extraction is provided for feature category combinations. We have conducted around 8624 experiments (different combinations of feature categories, binarization methods, feature reduction/selection, normalization, and crystal categories). The best experimental results are obtained using combinations of intensity features, region features using Otsu’s thresholding, region features using green percentile G ₉₀ thresholding, region features using green percentile G ₉₉ thresholding, graph features, and histogram features. Using this feature set combination, 96% accuracy (without misclassifying crystals as non-crystals) was achieved for the first level of classification to determine presence of crystals. Since missing a crystal is not desired, our algorithm is adjusted to achieve a high sensitivity rate. In the second level classification, 74.2% accuracy for (5-class) crystal sub-category classification. Best classification rates were achieved using random forest classifier.

Contributions

The feature extraction and classification could be completed in about 2 s per image on a stand-alone computing system, which is suitable for real time analysis. These results enable research groups to select features according to their hardware setups for real-time analysis.

Peer Review reports

Introduction

Protein crystallization is a highly empirical process that depends on numerous factors such as pH and temperature of the environment, protein concentration, the type of precipitant, ionic strength of the solution, gravity, the crystallization methods, etc. [1] A combination of all these factors suitable for the protein being crystallized is critical for the formation of crystals, and the prediction of these parameters is quite challenging since there is no prior information about the protein solubility [2, 3]. Therefore, thousands of experimental trials may be required for successful crystallization. Today, high-throughput robotic systems are routinely used to increase the chance of successfully obtaining crystals. Because of the high throughput crystallization trials, manual review of crystallization trials becomes practically discouraging in terms of time and resources. Therefore, automated image scoring systems have been developed to collect and classify the crystallization trial images. The fundamental aim is to discard the unsuccessful trials, identify the successful trials, and possibly identify those trials which could be optimized.

Challenges of protein crystallization classification

Imaging techniques are used to capture the state change or the possibility of forming crystals [4]. Building a reliable system to classify and analyze the crystallization trial can be very helpful to the crystallographers by reducing the number of tedious manual reviews of unsuccessful outcomes or providing the phase of the crystallization process. Such a system requires extracting features from images. After these features are used to train a classifier, the classifier model is used to classify new trial images. However, building a classifier model with high accuracy is challenging due to following reasons.

1.
Many Phases of Crystallization Process. The instruction sheets with crystallization screens from Hampton Research describe 9 possible protein crystallization trial outcomes or phases¹ [5] (Clear drop, Phase separation, Granular precipitate, Microcrystals, Posettes/spherulites, Needles, 2D Plates, Small 3D crystals, Large 3D crystals). Figure 1 shows sample protein crystallization trial images obtained using trace fluorescence labeling [6] where each image corresponds to a specific phase of crystallization. In analysis of the screening images, it is important to predict/detect the current phase of the experiment. Phases that yield crystalline outcomes or likely-leads are more valuable than other categories. Misclassification of the images in a higher category (e.g., crystal category) into a lower category (e.g., non-crystal category) is a serious problem as it results in a lead condition being missed. The misclassification of a lower category result to a higher is not as serious, and can be considered as a cost of capturing all possible leads.
Fig. 1
Sample protein crystallization trial images a-c) non-crystals, d-f) likely-leads, and g-i) crystals. Reprinted with permission from [28]. Copyright 2013 American Chemical Society
Full size image
2.
Unbalanced Distribution of Data. The distribution of data in different categories (or phases) is unbalanced. Frequency of higher (crystalline) categories are less than the frequency of lower categories. The classification models can be affected adversely by the unbalanced distribution. They may classify in favor of more frequent but less important categories.
3.
Complexity of Image Analysis. Non-uniform shapes and varying orientation of crystals impose complexity in image analysis. Intra-class diversity of a single crystal sub-category is significantly high. It is difficult to build a classifier with high accuracy that can model all variations.
4.
Multiple Types of Crystals in a Single Image. A single image can consist of objects (crystals) in different morphologies, such as dendrites and 3D crystals. In such cases, the expected class for the image would be the class corresponding to the highest class among all crystal objects.
5.
Low and Varying Image Quality. Since crystals are floating in a 3D well, not all crystals may be captured in focus. To observe the phases of crystallization, images are captured a number of times during the process. The lighting conditions may vary each time the images are collected. Varying illumination and focusing affect the pre-processing of images and features used for classification.
6.
Ambiguity in Labeling Trial Images. Protein crystallization is an evolving process. In some scenarios, there is a semantic transition between categories, meaning the images cannot be clearly assigned to one category. Similarly, ambiguities and subjectivity of the viewer or an expert can affect the labeling process or expert scoring.

Related work

In general, protein crystallization trial image analysis work is compared with respect to the accuracy of classification. The accuracy depends on the number of categories, features, and the ability of classifiers to model the data. Moreover, the hardware resources, training time and real-time analysis of new images are important factors that affect the usability of these methods. Table 1 provides the summary of related work with respect to different factors.

Table 1 Summary of related work

Feature analysis for classification of trace fluorescent labeled protein crystallization images

Abstract

Background

Results

Contributions

Introduction

Challenges of protein crystallization classification

Related work

Feature analysis for building real-time classifiers

Materials and methods

Image categories

Non-crystals

Likely-leads

Crystals

Data

Feature normalization, reduction and classification techniques

Feature normalization with z-score

Feature reduction with PCA and M D A−R F

Classification techniques

Image processing

Image thresholding

Region segmentation

Feature extraction

Intensity features

Histogram features

Texture features

Region features

Graph features

Shape adaptive features

Results

Time to extract features and classify

Experiments

Evaluating features for hierarchical classification

First level (3-class) classification

Second level classification

Discussion

Conclusion

Endnotes

Appendix : list of features

Abbreviations

References

Acknowledgments

Funding

Availability of data and materials

Authors’ contributions

Competing interests

Consent for publication

Ethics approval and consent to participate

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BioData Mining

Contact us