Skip to main content

m1A-Ensem: accurate identification of 1-methyladenosine sites through ensemble models

Abstract

Background

1-methyladenosine (m1A) is a variant of methyladenosine that holds a methyl substituent in the 1st position having a prominent role in RNA stability and human metabolites.

Objective

Traditional approaches, such as mass spectrometry and site-directed mutagenesis, proved to be time-consuming and complicated.

Methodology

The present research focused on the identification of m1A sites within RNA sequences using novel feature development mechanisms. The obtained features were used to train the ensemble models, including blending, boosting, and bagging. Independent testing and k-fold cross validation were then performed on the trained ensemble models.

Results

The proposed model outperformed the preexisting predictors and revealed optimized scores based on major accuracy metrics.

Conclusion

For research purpose, a user-friendly webserver of the proposed model can be accessed through https://taseersuleman-m1a-ensem1.streamlit.app/.

Peer Review reports

Introduction

1-methyadenosine (m1A) sites are reported to be present in transfer RNA (tRNA), messenger RNA (mRNA), and ribosomal RNA (rRNA). In tRNA, these sites occurred in T¥C loop at position 58, as shown in Fig. 1. The identification of m1A sites is significant because of its prominent role in various human diseases such as Mitochondrial respiratory chain defects, Neurodevelopmental regression, X-linked intractable epilepsy, and Obesity [1,2,3]. Moreover, this PTM modification is actively involved in protein translation, reverse transcription, and reticence in tumors. The m1A site prediction is critical for fully comprehending its potential functions. Site-directed mutagenesis and mass spectrometry have been proposed as methods for detecting m1A sites, although both are complex and time-consuming [4]. The availability of sequence-based datasets has increased the possibility of applying computational intelligence methods for the prediction of PTM sites.

Fig. 1
figure 1

Position 58 in tRNA loop contains 1-methyladenosine site

Chen et al. [5] initially developed a predictor, RAMPred, for the identification of m1A sites using Homosapiens, Mus musculus, and Saccharomyces cerevisiae samples. The obtained RNA samples were encoded using nucleotide chemical property (NCP). The obtained features were used to train the support vector machine (SVM) based model. The results revealed 99.13% accuracy (ACC), 99.89% specificity (Sp), 98.38% sensitivity (Sn), and a 0.98 Matthews correlation coefficient (MCC). The researchers also developed an online webserver for RAMPred. In another study, Chen et al. [6] developed a predictor, iRNA-3typeA, for the identification of three types of RNA methylation sites, including 6-methyladenosine (m6A), m1A, and adenosine-to-inosine (A-to-I). The same data samples of Homosapiens and Mus Musculus were used previously in RAMPred. The results revealed an accuracy score of 99.13% in Homosapiens and 98.73% in Mus musculus species. A 41 nucleotides lengthy sample was used, and cross validation test was carried out for performance evaluation. In another study Liu et al. [7] suggested a prediction model, ISGm1A, that extract 75 genomic-based features from the RNA sequences. Five machine learning models were trained and validated through independent testing and cross validation. Sun et al. [8] developed a deep learning framework, DeepMRMP, based on bidirectional gated recurrent unit (BGRU) for the identification of multiple RNA post transcriptional modified (PTM) sites in Homosapiens, Mus Musculus and Saccharomyces Cerevisiae species. One-hot encoding was used to encode the nucleotides within a sequence i.e. A = [1,0,0,0], C = [0,1,0,0], G = [0,0,1,0], U = [0,0,0,1]. The model revealed 70.5% ACC, 0.85 Sn, 0.95 Sp and 0.83 mcc.

Previous research studies dealt with the identification of m1A sites through traditional machine learning algorithms. However, such models are subjected to imbalanced data issue, overfitting and underfitting problems, and having limited context understanding. The current study proposed a novel framework for the prediction of m1A sites using ensemble models. These models were categorized into blending, bagging, and boosting which provides more rigorous training on dataset. It's worth mentioning here that RAMPred, iRNA3typeA, ISGm1A, and DeepMRMP have used the same dataset for training and validation. The dataset is composed of RNA sequences belonging to four species: Homosapiens, Saccharomyces cerevisiae, Mus musculus and Schizosaccharomyces pombe. The extraction of meaningful attributes from the sequences was carried out by considering the position and formation of nucleotide bases. Statistical moments were calculated that helped in feature dimensionality reduction in few metrics developed for attributes extraction. The performance of these ensemble models was evaluated through k-fold cross validation and independent set testing. The accuracy metrics such as ACC, Sp, Sn, and MCC were used to evaluate the ensemble models quantitatively. The results revealed that the proposed model outperformed in all accuracy metrics comparable to the preexisting m1A sites predictors. This research study was conducted in different phases, including benchmark dataset assortment, feature extraction and sample formulation, model development, training, and testing. Ultimately, a publicly accessible server was also made for facilitating in m1A sites detection. A methodology framework has been depicted in Fig. 2.

Fig. 2
figure 2

Current research methodology

Materials and methods

Dataset collection

The dataset acquired from RMBase v2.0 [9] containing RNA samples from four species, including Homosapiens, Saccharomyces cerevisiae, Mus musculus, and Schizosaccharomyces pombe designated as HS_17880, SC_3406, MM_4232 and SP_958. The dataset details have been mentioned in Table 1. After CD-Hit at 80%, the positive samples obtained were 11,978 and the negative samples obtained were 12,716. The cutoff was selected at 80% because of large number of samples. There might be a possibility of homology existing within samples. The window size for each RNA sample was chosen at 41 since this yielded the best overall performance. The window size was selected due the availability of 41nt verified samples and the optimized results revealed by this specific length. The m1A site-expressing RNA sample described in [1].

$$B\left(A\right)={B}_{-\intercal }{B}_{-\left(\intercal -1\right)}. . . . {B}_{-2}{B}_{-1}{AB}_{+1}{B}_{+2}. . . .{B}_{+\left(\intercal -1\right)}{B}_{+\intercal }$$
(1)

whereas “\(A\)” represents modified adenine of RNA sequences with methylated m1A sites.

Table 1 Details of RNA samples used in this study

The arrangement of nucleotide bases within the acquired sequences can be visualized using a sequence logo. To achieve this, an online tool known as the "Two Sample Logo” was utilized. Figure 3 displays the sequence logo, which effectively represents the presence of cytosine (C), guanine (G), adenine (A), and uracil (U) within the dataset.

Fig. 3
figure 3

Two sample logos of the data samples representing nucleotide distributions

The nucleotides sample logo illustrates the concentration of “U” and “A” nucleotides throughout the sequence. However, the central position at “21” includes the “A”. Moreover, the nucleotide “G” is symmetrically distributed along the whole samples. It can be observed that “C” is only located from position 19 to 23 within nucleotide sequence.

Feature extraction and development phase

The most important phase of computational procedures is feature extraction. During this stage, features are extracted to emphasize the dataset's unique characteristics [10]. Due to recent advances in information and data sciences, biotechnology has made major strides forward. Yet, the most difficult aspect is the development of computationally sophisticated models that transform raw biological input into counted, quantified vectors. Moreover, the loss of a single sequence or its associated properties must be prevented. This is due to the fact that all inputs to machine learning algorithms are vectors. The current research adopted a novel feature extraction method which includes various matrices and vectors for attaining the useful attributes from the sequences. These specialized vectors and matrices were indigenously developed for extracting divulged as well as concealed features within the sequences. This would be helping in developing more robust computational models that would assist in identification of m1A sites in an optimized way. To prevent the complete loss of the sequence-pattern information, Chou developed a pseudo-amino acid composition for proteins (PseAAC) [11]. Then pseudo-K-tuple nucleotide composition (PseKNC) was formulated as a result of the PseAAC success [12, 13]. Additionally, an RNA sequence, \(X\), can be illustrated, as shown in [2].

$$X={X}_{1},{X}_{2},{X}_{3},\dots ,{X}_{i},\dots ,{X}_{n}$$
(2)

whereas,

$${X}_{i}\upepsilon \{C\left(cytosine\right), A\left(Adenine\right), G\left(guanine\right), U(uracil)\}$$

represents a nitrogenous base at a random position within an RNA sample. The genomic data used in this study was transformed into a matrix, \(f{\prime}\), as shown in [3].

$$f{\prime}= {\left[{f}_{1}{f}_{2}{f}_{3}{f}_{4}\dots {f}_{ u}\dots {f}_{ \Omega }\right]}^{\intercal }$$
(3)

A single feature, \({f}_{u}\), depicts an arbitrary numerical coefficient which characterize a single feature. The transpose was taken for yielding discrete coefficients.

Statistical moments calculation

A fixed-length feature vector was computed from the genomic data using statistical moments [14]. Statistical moments are essential tools in statistics and probability theory that provide valuable information about the distribution of data. They are used to describe the shape, central tendency, spread, and other characteristics of a dataset. The significance of statistical moments lies in their ability to summarize and quantify various aspects of data distributions, making them useful in a wide range of applications, including data analysis, modeling, and decision-making. Moments of various distributions have been studied by analysts and mathematicians [15]. By computing the central, Hahn, and raw moments, a compact feature set was generated, which was then utilized to reduce the colossal input vector. Therefore, moments were computed for dimensionality reduction. The feature set was expanded to incorporate the scale and area of important moments to help differentiate between functionally distinct sequences. According to scientific investigations, genomic and proteomic sequence-based characteristics alter with the content and relative location of their bases [16]. Hence, the feature vector is best generated using mathematical and computational models that are sensitive to the relative location of component bases within genomic sequences. The features were transformed into compact coefficients that accurately reflect the data's mean and standard deviation using raw, central, and Hahn moments. While attempting to decipher a sequence, scale and position variations like the Raw and Hahn moments are preferable. \(A two-dimensional matrix\), Ƕʹ, was built from the sequences, with each entry, Ƕmn, representing the \({n}_{th}\) nucleotide base in the, \({m}_{th}\), sequence as expressed in [4].

(4)

Raw moments are used to derive location variant characteristics from extracted features [17]. Raw moments are described in [5], where the total number of raw moments is denoted by the value of u + v. The coefficients Ɲ00, Ɲ01, Ɲ10, Ɲ11, Ɲ12, Ɲ21, Ɲ30, and Ɲ03 were determined up to the third-degree polynomial [18, 19].

$${N}_{jk}= {\sum }_{c =1}^{m}{\sum }_{d =1}^{m}{c}^{j}{d}^{k}{\beta }_{cd}$$
(5)

The significance of the central moments is unrelated to the nucleotide’s location. These, on the other hand, are associated with the composition and form of the distribution [20]. Moreover, the central moments are associated with the nucleotides’ composition and distribution. For the current study, the central moments were computed and expressed in [6] as follows.

$${n}_{ij}= {\sum }_{b =1}^{n}{\sum }_{q =1}^{n}{\left(b- \mathcal{x}\right)}^{i}{\left(q- \mathcal{y}\right)}^{j}{\beta }_{bq}$$
(6)

Orthogonal moments are often preferred because they can represent data with the least amount of redundant information. Yet, even if the original sequences have been drastically shortened to a fixed length, the predictor still gets the effect of the whole sequence of data within the reduced feature vector due to the reversible quality of these moments. Hahn polynomials can be written as follows:

$${h}_{n}^{u,v}\left(r,N\right)= {\left(N+V-1\right)}_{n}{\left(N-1\right)}_{n}\times {\sum }_{k=0}^{n}{\left(-1\right)}^{k}\frac{{\left(-n\right)}_{k}{\left(-r\right)}_{k}{\left(2N+u+v-n-1\right)}_{k}}{{\left(N+v -1\right)}_{k}{\left(N-1\right)}_{k}}\frac{1}{k!}$$
(7)

where,\((u,v)\), are adjustable parameters that control polynomial shapes. Given a sequence in the form of a two-dimensional matrix, \(M X M\), the Hahn moment can be described as mentioned in [8].

$${H}_{ij}= {\sum }_{q=0}^{N-1}{\sum }_{p=0}^{N-1}{\beta }_{ij}{h}_{j}^{\widetilde{u,} v}\left(p,N\right), m,n=\mathrm{0,1},N-1$$
(8)

Position Relative Incidence Matrix (PRIM)

The position relative incidence matrix (PRIM) was used to represent the relative positioning of nucleotide bases within an RNA sample [21]. The matrix, \({{\varvec{E}}}_{{\varvec{P}}{\varvec{R}}{\varvec{I}}{\varvec{M}}}\) [9], is a \(4\mathrm{X }4\) matrix that represents any single nucleotide, \({{\text{V}}}_{{\varvec{m}}}\), at position \("{\varvec{m}}"\), with respect to other nucleotides within a sequence. The matrix generated 16 unique coefficients.

$${E}_{PRIM}= \left[\begin{array}{ccc}{V}_{A\to A}& {V}_{A\to G}& \begin{array}{cc}{V}_{A\to U}& {V}_{A\to C}\end{array}\\ {V}_{G\to A}& {V}_{G\to G}& \begin{array}{cc}{V}_{G\to U}& {V}_{G\to C}\end{array}\\ \begin{array}{c}{V}_{U\to A}\\ {V}_{C\to A}\end{array}& \begin{array}{c}{V}_{U\to G}\\ {V}_{C\to G}\end{array}& \begin{array}{c}\begin{array}{cc}{V}_{U\to U}& {V}_{U\to C}\end{array}\\ \begin{array}{cc}{V}_{C\to U}& {V}_{C\to C}\end{array}\end{array}\end{array}\right]$$
(9)

where, \({{\text{V}}}_{i\to j}\), represents the relative positioning of an arbitrary nucleotide base with respect to any other random base within a sequence. The occurrence of nucleotide base pairs (i.e., AA, AG, AU, …, CG, CU, CC) is significant in the feature extraction process. The formation of a \(16\mathrm{ X }16\) matrix known as \(\check{U}_{PRIM}\) [10], which results in 256 coefficients, was used to consider the frequency with which these base pairings occur in comparison to one another.

$${\check{U}}_{PRIM}= \left[\begin{array}{ccccccc}{\check{U}}_{AA\to AA}& {\check{U}}_{AA\to AG}& {\check{U}}_{AA\to AU}& \cdots & {\check{U}}_{AA\to j}& \cdots & {\check{U}}_{AA\to CC}\\ {\check{U}}_{AG\to AA}& {\check{U}}_{AG\to AG}& {\check{U}}_{AG\to AU}& \cdots & {\check{U}}_{AG\to j}& \cdots & {\check{U}}_{AG\to CC}\\ {\check{U}}_{AU\to AA}& {\check{U}}_{AU\to AG}& {\check{U}}_{AU\to AU}& \cdots & {\check{U}}_{AU\to j}& \cdots & {\check{U}}_{AU\to CC}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {\check{U}}_{GU\to AA}& {\check{U}}_{GA\to AG}& {\check{U}}_{GU\to AU}& \cdots & {\check{U}}_{GA\to j}& \cdots & {\check{U}}_{GA\to CC}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {\check{U}}_{N\to AA}& {\check{U}}_{N\to AG}& {\check{U}}_{N\to AU}& \cdots & {\check{U}}_{N\to j}& \cdots & {\check{U}}_{N\to CC}\end{array}\right]$$
(10)

Similarly, another matrix, ȽPRIM [11], was formed for the tri-nucleotide base combination (i.e., AAA, AAG, AAU, …. CCG, CCU, CCC). A total of 4096 coefficients were yielded by this matrix. The central, Hahn and raw moments were computed for \({E}_{PRIM}\), \(\check{U}_{PRIM}\) and ȽPRIM, that resulted in forming coefficients up to order 3.

(11)

Reverse Position Relative Incidence Matrix (RPRIM)

The primary objective of determining feature vectors is to collect as much relevant information as possible to develop an accurate prediction model. Reversing the sequence order yielded a reverse position relative indices matrix (RPRIM) in an effort to extract more information contained within the sequences [22]. Similarly with PRIM matrices, RPRIM was calculated using mononucleotide, dinucleotide, and trinucleotide combinations. For this reason, ƦRPRIM was computed according to [12].

(12)

Frequency vector determination

The sequence's positional and compositional information is crucial in developing a feature set [23, 24]. The composition of the sequence can be determined by counting the frequency of each nucleotide. A frequency vector (Ᵹ) is used to store the count for each nucleotide or nucleotide pair in the sequence, and the method for calculating this vector has been described in [13].

(13)

where,  , is the count of the \({i}_{th}\) nucleotide in a sequence.

Generation of Accumulative Absolute Position Incidence Vector (AAPIV)

The AAPIV (accumulated information of individual nucleotide bases) is a method used to provide information on the frequency of each individual nucleotide base in a sequence [25]. This method is responsible for collecting and accumulating data related to the occurrence of each nucleotide base, including single and paired nucleotide bases [26, 27]. To achieve this, three different AAPIV vectors were generated, each representing a different level of granularity. These vectors were given the names \({S}_{AAPIV4}\) [14], \({S}_{AAPIV16}\) [15] and \({S}_{AAPIV64}\) [16]. Each vector represents a different level of granularity, with \({S}_{AAPIV4}\) containing information on four nucleotides, \({S}_{AAPIV16}\) containing information on sixteen nucleotides, and \({S}_{AAPIV64}\) containing information on sixty-four nucleotides. These vectors provide a useful tool for analyzing the composition of nucleotide sequences and can be used in a variety of biological applications. 

(14)
(15)
(16)

where, þi, can be calculated as provided in [17].

$${\delta }_{i}={\sum }_{k=1}^{n}{{\text{p}}}_{k}$$
(17)

Reverse Accumulative Absolute Position Incidence Vector (RAAPIV) Generation

To analyze the reversed sequences, a reverse accumulative absolute position incidence vector (RAAPIV) had been devised in the research. Specifically, it involves reversing the order of the nucleotide sequences in order to gain a different perspective on the underlying data. There are three types of nucleotide combinations that were examined using the RAAPIV: single nucleotide combinations, di-nucleotide combinations, and tri-nucleotide combinations. The vector length for each of these combinations differs, with a length of 4 for single nucleotides, 16 for di-nucleotides, and 64 for tri-nucleotides. The expression (18), (19) and (20) referred to the combination of single nucleotide, dinucleotides and trinucleotides respectively. Overall, this technique provides a way to gain new insights into genetic sequences by analyzing them from a different perspective.

$${J}_{RAAPIV4}=\left\{{j}_{1,}{j}_{2,}{j}_{3,}{j}_{4}\right\}$$
(18)
$${J}_{RAAPIV16}=\left\{{j}_{1,}{j}_{2,}{j}_{3,}\dots ,{j}_{16}\right\}$$
(19)
$${J}_{RAAPIV64}=\left\{{j}_{1,}{j}_{2,}{j}_{3,}\dots ,{j}_{64}\right\}$$
(20)

Feature vector formulation

The outcome of the feature extraction operation was the creation of a single feature vector. This feature vector was then utilized as a prediction model input with 522 distinct values collected by PRIM, RPRIM, FV, AAPIV, and RAAPIV. Each feature vector in the dataset represents an individual sample. For binary classification, positive samples were labelled as "1" and negative samples as "0″ [28, 29]. Table 2 contains the detail of the number of features obtained from each vector or matrix individually.

Table 2 Number of features obtained from each vector and matrix

Ensemble models development and training

Ensemble methods have gained popularity in the field of machine learning due to their enhanced prediction capabilities as compared to conventional single-model approaches [30, 31]. These methods combine the strengths of multiple models to achieve better overall performance, and they can be broadly classified into parallel and sequential methods. To address real world challenges, ensemble models help in building trust, model aggregation, prediction on different patterns based on diverse classifiers and features-based analysis. Parallel ensemble methods, such as bootstrap aggregation (or bagging), involve training multiple models concurrently on different subsets of the data. Sequential ensemble methods, on the other hand, involve training models sequentially, with each subsequent model learning from the errors of the previous one. Ensemble-based classification has been reported in various research studies. Akbar et al. [20] devised a novel method for the identification of anticancer peptides based on the genetic algorithms-based ensemble models which achieved optimized accuracy scores. Moreover, in another research study, authors devised an ensemble-based model for the identification of antitubercular peptides and the accuracy scores reported to be more than 90% [32]. Ahmed et al. [33] proposed, iAFPs-EnC-GA, an ensemble learning based model for the identification antifungal peptides. In the context of the investigation mentioned, three distinct ensemble models were applied including blending, bagging, and boosting.

Blending ensemble

Blending is an ensemble technique that combines the outputs of multiple classification or regression models using a meta-classifier or meta-regressor [34, 35]. In this approach, the base-level models are first trained, and their outputs are then used as features for the meta-model. This meta-model leverages the knowledge of the base models to make more accurate and robust predictions. The current investigation employed four base models, including an artificial neural network (ANN), a k-nearest neighbor (KNN), a support vector machine (SVM), and a decision tree (DT). The gradient boost classifier was chosen as the meta-classifier to combine the outputs of these base models. Hyperparameter optimization is an essential step in machine learning, as it ensures that each model performs at its best. Table 3 presents the details of the hyperparameter optimization process for all the classifiers used in the blending ensemble deployment.

Table 3 Parameters tuning of the blending ensemble model

Bagging ensemble

The bagging ensemble methods in the research deployed in such a way that the trained samples were divided into smaller subsamples for the base models using a subsampling approach with replacement and row sampling. This strategy ensures that each base model is trained on a different subset of the data, promoting diversity among the individual models and reducing the overall variance of the ensemble [36].

The test data were evaluated using the trained base models, and the final forecast was obtained through a voting mechanism, which typically involves majority voting for classification tasks or averaging for regression tasks. Four bagging models, namely the bagging classifier, random forest, extra tree, and decision tree classifier, were developed and trained as part of the investigation. For improved results, all the bagging classifiers were subjected to hyperparameters optimization. The hyperparameters such as number of trees (n_estimators), depth of each tree (max_depth), maximum features (max_features), and a few other important parameters such as min_samples_split, bootstrap, and min_samples_leaf were considered. Table 4 contains the hyper-parameter optimization information of the aforementioned bagging models.

Table 4 Parameters tuning of the bagging ensemble models

Boosting ensemble

The boosting ensemble approach is designed to optimize the model based on the output of the preceding model in the sequence. It operates sequentially, with each model focusing on reducing the differentiable loss by learning from the errors of the previous model. This process helps boost the overall performance of the ensemble by combining the strengths of multiple weak learners. In the current investigation, several boosting ensemble training approaches were employed, including gradient boosting, histogram-based gradient boosting (HGB), AdaBoost, and extreme gradient boosting (XGB). To optimize the performance of the boosting ensemble models, various hyperparameters were fine-tuned, as shown in Table 5. Figure 4 depicts the concept diagram of ensemble model implementation for the current research study, which includes blending, boosting, and bagging.

Table 5 Hyper-parameters optimization of the boosting ensemble models
Fig. 4
figure 4

Ensemble models Development and Training/Testing for the Current research study using RNA samples from RMBase (A). Blending Ensemble (B). Bagging Ensemble (C). Boosting Ensemble

Results and discussion

The trained models were subjected to validation using independent set testing and tenfold cross validation. The independent test was carried out using the standard “Train-Test” split method. However, tenfold cross validation is a rigorous test that divides the whole dataset into subsamples, where one sample is subjected to testing while the other nine are used for training. Different accuracy metrics were used to score the performance of all ensemble models, including ACC, Sp, Sn, and MCC.

Metrics for evaluation

In this research, four metrics, \({S}_{n}\), \({S}_{p}\), \(Acc\), and \(MCC\) were used to evaluate the prediction models [37, 38]. The effectiveness of a categorization model may be measured in terms of its \(Acc\). The \(Acc\) rate is the ratio of the model's correct predictions to the total number of forecasts. It is the fraction of the dataset that was properly predicted relative to the total number of occurrences. Whereas Specificity \(({S}_{p})\) is a metric used to evaluate the performance of a binary classification model, particularly in cases where the negative class is of greater importance. It measures the proportion of true negatives (TN) that are correctly identified by the model out of all negative instances. Sensitivity \(({S}_{n})\) is a metric used to evaluate the performance of a binary classification model, particularly in cases where the positive class is of greater importance. It measures the proportion of true positives (TP) that are correctly identified by the model out of all positive instances. Matthews Correlation Coefficient \((MCC)\) is a metric used to evaluate the performance of a binary classification model, particularly when the classes are imbalanced. MCC takes into account the number of true and false positives and negatives to give a balanced measure of the model's performance. The accuracy metrics equations have been mentioned in [22].

$$\left\{\begin{array}{c}{S}_{n}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FN}}} 0\le {S}_{n}\le 1\\ {S}_{p}=\frac{{\text{TN}}}{{\text{TN}}+{\text{FP}}} 0\le {S}_{p}\le 1\\ Acc=TP+TN / (TP+FP+FN+TN) 0\le Acc\le 1 \\ MCC=\left({\text{TP}}*{\text{TN}}-{\text{FP}}*{\text{FN}}\right) / \sqrt{({\text{TP}}+{\text{FP}})({\text{TP}}+{\text{FN}})({\text{TN}}+{\text{FP}})({\text{TN}}+{\text{FN}})} -1\le MCC\le 1\end{array}\right.$$
(21)

The TP denotes the m1A sites, whereas the TN denotes the non-m1A sites. A similar notation, FN, represents the total number of modified sites that were indeed actual m1A sites but were misidentified as false m1A sites. Furthermore, FP stands for the total number of false m1A sites that were misidentified. However, it's important to note that the measurements only apply to systems with a single class [39]. The false positive and false negative value have crucial roles in the performance evaluation of the system. A wrong detection of false positive leads to the wrong m1A site detection within a given RNA sample. Similarly, the increase in false negatives may result into the increase in non-m1A sites abnormally.

Data preprocessing

The obtained feature set was subjected to data preprocessing by using standard scaling of sklearn preprocessing [40]. All the missing values were removed using standard scaling before input to the machine learning model.

Independent set testing

An Independent test set was carried out to validate all the ensemble models, including blending, bagging, and boosting. The independent set was created using the standard “train-test split” method with a 70% training and 30% testing dataset [41, 42]. There were 8385 positive and 8901 negative train samples. The test samples were 3593 positives and 3814 negatives. It is important to mention that training and test samples were separate frofutm each other. Table 6 contains the results revealed by all the ensemble models deployed for the current research. Whereas Fig. 5 depicts the area under curve (AUROC) of the ensemble model in independent testing.

Table 6 Independent testing result
Fig. 5
figure 5

ROC curve of independent testing (A) Boosting Ensemble (B) Blending Ensemble (C) Bagging Ensemble

10-Fold cross validation

The cross-validation approach is used to test all the samples while splitting the dataset into “k” disjoint folds [43, 44]. The robustness of a model is demonstrated by this more stringent test. In this test, k-1 folds (partitions) were trained on the model, while testing was performed on the left-over fold [45]. The test was repeated 10 times due to the number of folds used in this study, i.e., k = 10. Cross-validation results have been listed in Table 7.

Table 7 10-Fold cross validation results

Several statistical tests were conducted to verify the effectiveness of the ensemble models implemented in this study. The primary goal of these tests was to compare the performance of various learning algorithms in achieving accurate classification outcomes. One of the tests conducted was a two-proportion test, commonly referred to as the Z test, on the ensemble models. This Z test was utilized to assess whether there existed a significant distinction between the two sets of samples. To establish such a distinction, the critical value (p) needed to be below 0.05, indicating the rejection of the null hypothesis. Furthermore, a resampled paired t-test was employed, using a predetermined set of trials, to measure the accuracy of the algorithms. McNemar's test, another statistical test, was applied to evaluate the significance of the difference between two proportions in a 2 × 2 contingency table. The resulting "p" values from these tests are listed in Table 8.

Table 8 Statistical test results of blending, boosting and bagging ensemble models

The violin plot is a graphical representation that combines elements of a box plot and a kernel density plot to display the distribution of numerical data for one or more groups [46]. It uses density curves to illustrate the probability density of the data at different values, giving a clear visualization of the data distribution, including its central tendency, dispersion, and shape. Key elements of a violin plot include (1) a central white dot representing the median of the data, which indicates the middle value when the data is sorted in ascending order. (2) A black bar in the middle of the violin, showing the interquartile range (IQR), which represents the spread of the middle 50% of the data. () Dark black lines extending from the black bar to the lower and higher neighboring values, indicating the range of the data within 1.5 times the IQR from the lower and upper quartiles. Figure 6 displays the violin plots for the accuracy values obtained in each fold for the best ensemble models in the blending, bagging, and boosting categories.

Fig. 6
figure 6

Violin plots of 10-Fold cross validation accuracy (Acc) metric results for (A) Blending ensemble (B) Bagging ensemble and (C) Boosting ensemble

The application of supervised machine learning models can prove beneficial in various categorization tasks. Nonetheless, relying solely on numerical predictions might not be enough. Gaining a comprehensive understanding of the actual decision boundary that delineates the different groups is crucial. Consequently, the classification algorithms employed in this research were examined using a decision surface to enhance their accuracy. A decision surface map is a visual representation where a trained machine learning system predicts a coarse grid covering the input feature space. This method allows for a better understanding of the model's decision-making process by illustrating the regions in which the model assigns a particular class to input data points. Figure 7 displays the decision surface plots of the classification algorithms used in this research. By examining these plots, one can gain insights into how the algorithms differentiate between the various classes and the effectiveness of their decision-making process. This information can be valuable for refining the models, improving their accuracy, and ensuring more reliable outcomes in categorization tasks.

Fig. 7
figure 7

Boundary visualization of ensemble models used in this study as follows: (A). Input data (B). Blending (C). Random Forest (D) ExtraTree (E) Decision Tree (F) Bagging (G) Gradient Boost (H) Histo Gradient Boost (I) Adaboost (J) XGBoost

Comparison with preexisting predictors

The proposed model was built on the best performing HGB ensemble model and compared with preexisting predictors to assess the model’s efficacy on the independent datasets. The predictors were RamPred, Deepmrmp, irna3typeA, and ISGm1A. It was observed that the proposed model, m1A-Ensem, outperformed exhibiting 0.99 ACC, 0.98 Sp, 0.97 Sn, and 0.98 MCC. The comparative results have been mentioned in Table 9. The use of vectors and matrices helped in extracting obscured features within the sequences. Moreover, the hyperparameter optimization of ensemble models helped in gaining promising accuracy scores. The identification of m1A sites is vital as this RNA modification has been implicated in various diseases such as Mitochondrial respiratory chain defects, Neurodevelopmental regression, X-linked intractable epilepsy, and Obesity. Moreover, m1A sites help in gene regulation procedures such as gene splicing, RNA stability and regulatory mechanisms. This modification is also involved in RNA folding and structure stability. Detecting these sites accurately is a critical step towards understanding the mechanisms behind these diseases and developing effective biomarkers for drug discovery. To address this issue, researchers have developed a comprehensive strategy that involves feature development and representation, merging multiple computational models, and testing the model using a variety of methodologies. This approach has resulted in the creation of a predictive model that outperforms existing models in identifying m1A sites. Extensive trials have shown that the proposed model has a high degree of precision, resilience, and scalability. Its accuracy in identifying modified m1A sites has been demonstrated through various testing methodologies, indicating its potential usefulness in research. Overall, the development of this predictive model represents a significant advancement in the field of RNA modification research, providing a valuable tool for researchers and clinicians in their efforts to better understand and treat diseases associated with m1A sites.

Table 9 Comparison with preexisting predictors

Limitations and future work

The limitation of the current work is the availability of RNA samples from a few species only. The number of available samples also limits the possibility of training computational models. Moreover, the discovery of new m1A sites related to samples will require the development of new models and training of those models on latest data samples. This will be affecting the results obviously. Moreover, the scope of the study is limited to the development of ensemble models for the identification of m1A sites. The prediction of m1A sites through deep learning models using the available data samples can be attempted in the future.

Web server availability

A web server offers a quick and simple way to do computational analysis. Additionally, the availability of such internet resources aids scholars in any upcoming breakthroughs. The m1A-Ensem, a free online web server for the suggested model, was created with this objective in mind and is accessible at https://taseersuleman-m1a-ensem1.streamlit.app/. It has four tabs including “Home”, “Predictor”, “Dataset” and “Citations”. The “Home” tab contains the m1A prediction model description. Figure 8 represents the screenshot of the webserver for the proposed model. The “Predictor” tab contains the sample sequence and input area. A user can input any length of sequence in the Input area. Figure 9 shows the “Predictor” tab with “Example” sequence button and Input area. The user has to click “submit” button and the result generated for each Adenosine (A) site as it is m1A site or non-m1A site. Figure 10 represent a sequence showing their actual position within the sequence and their status (m1A site of non-m1A site). Similarly, the “Dataset” tab contains the dataset samples used for training and testing the models. Figure 11 depicts the “Dataset” image.

Fig. 8
figure 8

Screenshot of m1A-Ensem prediction webserver

Fig. 9
figure 9

Image showing webserver “Predictor” page with “Example” sequence

Fig. 10
figure 10

Webserver identifying m1A and non-m1A sites within RNA sample

Fig. 11
figure 11

“Dataset” Tab representing the positive and negative samples

Conclusion

This study focused on detecting one of the most common post-transcriptional modifications, 1-methyladenosine (m1A), in RNA sequences using ensemble methods. Identifying m1A sites is crucial as this modification is associated with various human diseases, including mitochondrial respiratory chain defects, neurodevelopmental regression, X-linked intractable epilepsy, and obesity. A novel feature extraction mechanism was developed, taking into account both the positional and compositional attributes of nucleotides within RNA sequences. By calculating statistical moments, feature dimensionality reduction was achieved, streamlining the analysis. The resulting feature set was used to train several ensemble models based on stacking, bagging, and boosting techniques. The trained models underwent evaluation through cross-validation and independent testing. Performance was assessed using well-known accuracy metrics such as accuracy, sensitivity, specificity, and Matthew's correlation coefficient. Based on the best-performing ensemble model, the proposed model, m1a-ensem, was constructed. A comparative analysis of m1A-Ensem was conducted against existing predictors to gauge its effectiveness. The results demonstrated that m1A-Ensem outperformed other predictors in all accuracy metrics. Consequently, it can be concluded that the proposed model successfully enhanced the ability to identify modified m1A sites by employing the techniques described above. In summary, the research developed a novel approach to detect m1A sites in RNA sequences, which has implications for understanding and potentially treating various human diseases. By incorporating ensemble methods and a unique feature extraction mechanism, the m1A-Ensem model demonstrated superior performance in comparison to existing predictors, highlighting its potential for further applications in this field.

Availability of data and materials

The data and code of the current research study is available at https://github.com/taseersuleman/m1A-ensem-model.

References

  1. Metodiev MD, Thompson K, Alston CL, Morris AAM, He L, Assouline Z, et al. Recessive mutations in TRMT10C cause defects in Mitochondrial RNA processing and multiple respiratory chain deficiencies. Am J Hum Genet. 2016;98(5):993–1000.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Falk MJ, Gai X, Shigematsu M, Vilardo E, Takase R, McCormick E, et al. A novel HSD17B10 mutation impairing the activities of the mitochondrial Rnase P complex causes X-linked intractable epilepsy and neurodevelopmental regression. RNA Biol. 2016;13(5):477–85.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Oie S, Matsuzaki K, Yokoyama W, Tokunaga S, Waku T, Han SI, et al. Hepatic rRNA transcription regulates high-fat-diet-induced obesity. Cell Rep. 2014;7(3):807–20.

    Article  CAS  PubMed  Google Scholar 

  4. Madec E, Stensballe A, Kjellstro S, Obuchowski M, Jensen ON, Cladie L, et al. Mass spectrometry and site-directed mutagenesis identify several Autophosphorylated residues required for the activity of PrkC, a Ser / Thr Kinase from Bacillus subtilis. J Mol Biol. 2003;2836(03):459–72.

    Article  Google Scholar 

  5. Chen W, Feng P, Tang H, Ding H, Lin H. RAMPred: Identifying the N1-methyladenosine sites in eukaryotic transcriptomes. Sci Rep. 2016;6(August):1–8. https://doi.org/10.1038/srep31080.

    Article  CAS  Google Scholar 

  6. Chen W, Feng P, Yang H, Ding H, Lin H, Chou KC. iRNA-3type A: identifying three types of modification at RNA’s Adenosine sites. Mol Ther - Nucleic Acids. 2018;11:468–74.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Liu L, Lei X, Meng J, Wei Z. ISGm1A: integration of sequence features and genomic features to improve the prediction of human m1A RNA methylation sites. IEEE Access. 2020;8:81971–7.

    Article  Google Scholar 

  8. Sun P, Chen Y, Liu B, Gao Y, Han Y, He F, et al. DeepMRMP: A new predictor for multiple types of RNA modification sites using deep learning. Math Biosci Eng. 2019;16(6):6231–41.

    Article  MathSciNet  PubMed  Google Scholar 

  9. Xuan J, Sun W, Lin P, Zhou K, Liu S, Zheng L, et al. RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res. 2018;46(D1):D327-D334. https://doi.org/10.1093/nar/gkx934.

  10. Che D, Liu Q, Rasheed K, Tao X. Decision tree and ensemble learning algorithms with their applications in bioinformatics. Adv Exp Med Biol. 2011;696:191–9.

    Article  CAS  PubMed  Google Scholar 

  11. Malebary SJ, Alzahrani E, Khan YD. A comprehensive tool for accurate identification of methyl-Glutamine sites. J Mol Graph Model. 2022;110:108074.

    Article  CAS  PubMed  Google Scholar 

  12. Naseer S, Hussain W, Khan YD, Rasool N. Optimization of serine phosphorylation prediction in proteins by comparing human engineered features and deep representations. Anal Biochem. 2021;615:114069.

    Article  CAS  PubMed  Google Scholar 

  13. Naseer S, Hussain W, Khan YD, Rasool N. iPhosS(Deep)-PseAAC: Identify Phosphoserine sites in proteins using deep learning on general pseudo amino acid compositions via modified 5-steps rule. IEEE/ACM Trans Comput Biol Bioinforma. 2020;19(3):1703–14.

  14. Butt AH, Khan YD. CanLect-Pred: A cancer therapeutics tool for prediction of target cancerlectins using experiential annotated proteomic sequences. IEEE Access. 2020;8:9520–31.

    Article  Google Scholar 

  15. Shahid M, Ilyas M, Hussain W, Khan YD. ORI-Deep: improving the accuracy for predicting origin of replication sites by using a blend of features and long short-term memory network. Brief Bioinform. 2022;23(2):bbac001.

    Article  PubMed  Google Scholar 

  16. Malebary SJ, Khan YD. Evaluating machine learning methodologies for identification of cancer driver genes. Sci Rep. 2021;11(1):12281.

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  17. Hussain W, Rasool N, Khan YD. Insights into Machine Learning-based approaches for Virtual Screening in Drug Discovery: Existing strategies and streamlining through FP-CADD. Curr Drug Discov Technol. 2021;18(4):463-72.

  18. Mahmood MK, Ehsan A, Khan YD, Chou K-C. iHyd-LysSite (EPSV): identifying hydroxylysine sites in protein using statistical formulation by extracting enhanced position and sequence variant feature technique. Curr Genomics. 2020;21(7):536–45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Barukab O, Khan YD, Khan SA, Chou K-C. DNAPred_Prot: identification of DNA-binding proteins using composition- and position-based features. Appl Bionics Biomech. 2022;2022:1–17.

    Article  Google Scholar 

  20. Akbar S, Hayat M, Iqbal M, Jan MA. iACP-GAEnsC: Evolutionary genetic algorithm based ensemble classification of anticancer peptides by utilizing hybrid feature space. Artif Intell Med. 2017;79:62–70.

    Article  PubMed  Google Scholar 

  21. Suleman MT, Alkhalifah T, Alturise F, Khan YD. DHU-Pred: accurate prediction of dihydrouridine sites using position and composition variant features on diverse classifiers. PeerJ. 2022;10:e14104.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Alghamdi W, Attique M, Alzahrani E, Ullah MZ, Khan YD. LBCEPred: a machine learning model to predict linear B-cell epitopes. Brief Bioinform. 2022;23(3):bbac035.

    Article  PubMed  Google Scholar 

  23. Hussain W, Rasool N, Khan YD. A sequence-based predictor of Zika virus proteins developed by integration of PseAAC and statistical moments. Comb Chem High Throughput Screen. 2020;23(8):797–804.

    Article  CAS  PubMed  Google Scholar 

  24. Awais M, Hussain W, Rasool N, Khan YD. iTSP-PseAAC: identifying tumor suppressor proteins by using fully connected neural network and PseAAC. Curr Bioinform. 2021;16(5):700–9.

    Article  CAS  Google Scholar 

  25. Suleman MT, Khan YD. m1A-pred: prediction of modified 1-methyladenosine sites in RNA sequences through artificial intelligence. Comb Chem High Throughput Screen. 2022;25:2473.

    Article  CAS  PubMed  Google Scholar 

  26. Shah AA, Malik HAM, Mohammad A, Khan YD, Alourani A. Machine learning techniques for identification of carcinogenic mutations, which cause breast adenocarcinoma. Sci Rep. 2022;12(1):11738.

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  27. Hung TNK, Le NQK, Le NH, Van Tuan L, Nguyen TP, Thi C, et al. An AI-based prediction model for drug-drug interactions in osteoporosis and Paget’s diseases from SMILES. Mol Inform. 2022;41(6):2100264.

    Article  CAS  Google Scholar 

  28. Le NQK, Nguyen TTD, Ou YY. Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. J Mol Graph Model. 2017;73:166–78.

  29. Naseer S, Ali RF, Khan YD, Dominic PDD. iGluK-Deep: computational identification of lysine glutarylation sites using deep neural networks with general pseudo amino acid compositions. J Biomol Struct Dyn. 2021;40(22):11691-704.

  30. Malebary SJ, Khan YD. Identification of antimicrobial peptides using Chou’s 5 step rule. Comput Mater Contin. 2021;67(3):2863–81.

    Google Scholar 

  31. Khan SA, Khan YD, Ahmad S, Allehaibi KH. N-MyristoylG-PseAAC: Sequence-based prediction of N-Myristoyl Glycine sites in proteins by integration of PseAAC and statistical moments. Lett Org Chem. 2018;16(3):226–34.

    Article  Google Scholar 

  32. Akbar S, Ahmad A, Hayat M, Rehman AU, Khan S, Ali F. iAtbP-Hyb-EnC: Prediction of antitubercular peptides via heterogeneous feature representation and genetic algorithm based ensemble learning model. Comput Biol Med. 2021;137:104778.

    Article  PubMed  Google Scholar 

  33. Ahmad A, Akbar S, Tahir M, Hayat M, Ali F. iAFPs-EnC-GA: Identifying antifungal peptides using sequential and evolutionary descriptors based multi-information fusion and ensemble learning approach. Chemom Intell Lab Syst. 2022;222:104516.

    Article  CAS  Google Scholar 

  34. Butt AH, Alkhalifah T, Alturise F, Khan YD. A machine learning technique for identifying DNA enhancer regions utilizing CIS-regulatory element patterns. Sci Rep. 2022;12(1):15183.

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  35. Khan YD, Khan NS, Naseer S, Butt AH. iSUMOK-PseAAC: Prediction of lysine sumoylation sites using statistical moments and Chou’s PseAAC. PeerJ. 2021;9:e11581.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Malebary SJ, Khan R, Khan YD. ProtoPred: advancing oncological research through identification of proto-oncogene proteins. IEEE Access. 2021;9:68788–97.

    Article  Google Scholar 

  37. Hassan A, Alkhalifah T, Alturise F, Khan YD. RCCC_Pred: a novel method for sequence-based identification of renal clear cell carcinoma genes through DNA mutations and a blend of features. Diagnostics. 2022;12(12):3036.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Shah AA, Alturise F, Alkhalifah T, Khan YD. Evaluation of deep learning techniques for identification of sarcoma-causing carcinogenic mutations. Digit Heal. 2022;8:205520762211337.

    Article  Google Scholar 

  39. Thrun MC, Gehlert T, Ultsch A. Analyzing the fine structure of distributions. Plos One. 2020;15(10):e0238835.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. sklearn.preprocessing.StandardScaler. Available from: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html. Cited 2020 Dec 17

  41. Arif M, Ahmed S, Ge F, Kabir M, Khan YD, Yu DJ, et al. StackACPred: Prediction of anticancer peptides by integrating optimized multiple feature descriptors with stacked ensemble approach. Chemom Intell Lab Syst. 2022;220:104458.

    Article  CAS  Google Scholar 

  42. Baig TI, Khan YD, Alam TM, Biswal B, Aljuaid H, Gillani DQ. Ilipo-pseaac: Identification of lipoylation sites using statistical moments and general pseaac. Comput Mater Contin. 2022;71(1):215–30.

    Google Scholar 

  43. Barukab O, Khan YD, Khan SA, Chou K-C. iSulfoTyr-PseAAC: identify tyrosine sulfation sites by incorporating statistical moments via Chou’s 5-steps rule and pseudo components. Curr Genomics. 2019;20(4):306–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Rasool N, Husssain W, Khan YD. Revelation of enzyme activity of mutant pyrazinamidases from Mycobacterium tuberculosis upon binding with various metals using quantum mechanical approach. Comput Biol Chem. 2019;83:107108.

    Article  CAS  PubMed  Google Scholar 

  45. Akbar S, Hayat M, Tahir M, Khan S, Alarfaj FK. cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model. Artif Intell Med. 2022;131:102349.

    Article  PubMed  Google Scholar 

  46. Alghamdi W, Alzahrani E, Ullah MZ, Khan YD. 4mC-RF: Improving the prediction of 4mC sites using composition and position relative features and statistical moment. Anal Biochem. 2021;633:114385.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

Researchers would like to thank the Deanship of Scientific Research, Qassim University for funding publication of this project.

Funding

This research received no external funding.

Author information

Authors and Affiliations

Authors

Contributions

The research investigation and original draft was prepared by Muhammad Taseer Suleman. Fahad Alturise worked on research methodology, investigation and reviewing the final draft. Tamim Alkhalifah validated the results and reviewed the final draft. Yaser Daanial Khan supervised the research and reviewed the final draft.

Corresponding author

Correspondence to Fahad Alturise.

Ethics declarations

Ethics approval and consent to participate

Ethical approval and consent was not required for the current research study.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suleman, M.T., Alturise, F., Alkhalifah, T. et al. m1A-Ensem: accurate identification of 1-methyladenosine sites through ensemble models. BioData Mining 17, 4 (2024). https://doi.org/10.1186/s13040-023-00353-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13040-023-00353-x

Keywords