The Particle Swarm Optimization algorithm
Kennedy J. and Eberhart R. (1995) [41] developed the Particle Swarm Optimization (PSO) as an algorithm that attempts to solve optimization problems by mimicking the behavior of a flock of birds. PSO has been used in the following fields: antennas, biomedical, city design/civil engineering, communication networks, combinatorial optimization, control, intrusion detection/cybersecurity, distribution networks, electronics and electromagnetics, engines and motors, entertainment, diagnosis of faults, the financial industry, fuzzy logic, computer graphics/visualization, metallurgy, neural networks, prediction and forecasting, power plants, robotics, scheduling, security and military, sensor networks, and signal processing [42,43,44,45,46]. Since PSO has been used in so many disparate fields, it appears robust and flexible, which gives credence to the idea that it could be used in this use case of bioinformatics and many others [47]. PSO falls into the optimization taxonomy of swarm intelligence [48]. PSO works by creating a set of particles or actors that explore a topology and look for the global minimum of that topology [48]. At each iteration, the swarm stores each particle's minimum result, as well as the global swarm's minimum, found. The particles explore the space with both a position and velocity, and they change their velocity based on three parameters. These three parameters are current velocity, distance to the personal best, and distance to the global best [48]. Position changes are made based on the calculated velocity during each iteration. The velocity function is as follows [49] in Eqs. (1) and (2):
$${V}_{n+1}=w*{V}_{n}+ {c}_{1}* {R}_{1}*\left( {P}_{n}- {X}_{n}\right)+ {c}_{2}* {R}_{2}*\left( {G}_{n}- {X}_{n}\right)$$
(1)
Then position is updated as follow:
$${X}_{n+1}= {X}_{n}+ {V}_{n+1}$$
(2)
where:
-
\({V}_{n}\) is the current velocity at iteration \(n\)
-
\({c}_{1}\) and \({c}_{2}\) are two real numbers that stand for local and global weights and are the personal best of the specific particle and the global best vectors, respectively, at iteration \(n\) [49].
-
The \({R}_{1}\) and \({R}_{2}\) values are randomized values used to increase the explored terrain [49].
-
\(w\) is the inertia weight parameter, and it determines the rate of contribution of a velocity [41].
-
\({G}_{n}\) represents the best position of the swarm at iteration \(n\).
-
\({P}_{n}\) represents the best position of an individual particle.
-
\({X}_{n}\) is the best position of an individual particle at the iteration \(n\).
Why PSO
This project's rationale is that using PSO could be a very efficient method for optimizing Hi-C data due to its inherent ability to hold local minima within its particles. This inherent property will allow sub-structures to be analyzed for optimality independently of the entire structure.
In Fig. 1, particle one is at the global best minimum found so far. However, particle two has a better structure in its top half, and it is potentially independent of the bottom half. Because particle one has a better solution so far, particle two will traverse towards the structure in particle one in the iteration \(n+1\). While particle two is traversing, it will go along a path that maintains its superior 3D model sections. Thus, it has a higher chance of finding the absolute minimum distance value. The more particles there are, the greater the time complexity of PSO and the higher the chance of finding the absolute minimum. The inherent breaking up of the problem could lend itself to powerful 3D structure creation results. More abstractly relative to Hi-C data but in the traditional PSO sense, the same problem as above might look as follows (Fig. 2) when presented in a topological map.
From Fig. 2, in the \(nth\) iteration, particle 1 found a local minimum within this step. Since of all the particles, this is the lowest point; particle two will search towards particle one with a random chance amount added to its velocity [48, 50, 51]. The random chance keeps particle two from going straight to the optimal solution [48]. In this case, particle two found the absolute minima, and from here on, all the particles will begin to migrate towards particle 2. We will test this hypothesis by analyzing its output with the evaluation metrics defined in the “Methods” section.
In summary, we believe the particle-based structure of PSO may lend itself well to the problem of converting Hi-C IF data into 3D models. We will test this hypothesis and compare our results to the existing modeling methods.
PSO for 3D structure reconstruction from Hi-C data
Here we describe how we implemented the PSO algorithm as a distance-based approach for 3D genome reconstruction from Hi-C data. This algorithm is called ParticleChromo3D. In this context, the input IF data is converted to the distance equivalent using the conversion factor, \(\alpha\), for 3D structure reconstruction. These distances are sometimes called the "wish" distances [30] as they are a computational approximation of the 3D spatial distance between the underlying loci or bins in the chromosome. Because they are inferred from the input IF distance, they are used as the true representation to evaluate our algorithm's performance. That is, the closer our algorithm can predict these distances, the better it is. First, we initialize the particles' 3D (x,y,z) coordinates for each genomic bin or region randomly in the range [-1, 1]. Next, we set the stop parameters for our algorithm; these parameters are the maximum number of iterations allowed and the error threshold. For each bin, we now calculate a velocity and then update our position. We used the sum of squared error function as the loss function to compute chromosome structures from a contact map for the evaluation performed in this work. We described the impact of using different loss functions in the “Discussion” section.
Finally, following the description provided for the PSO algorithm above, we used the PSO to iteratively improve our function until it has converged on either an absolute or local minimum. The complete ParticleChromo3D algorithm is presented in Fig. 3. Some parameters are needed to use the PSO algorithm for 3D structure reconstruction. This work has provided the parameter values that produced our algorithm's optimal results. The users can also provide their settings to fit their data where necessary. The results of the series of tests and validation performed to determine the default parameters are described in the "Parameters Estimation" section of the Methods section.
Particle representation
A particle is a candidate solution. A list of XYZ coordinates represents each particle in the solution. The candidate solution's length in the number of regions in the input Hi-C data. Each particle's point is the individual coordinate, XYZ, of each bead. A swarm consists of N candidate solution, also called the swarm size, which the user provides as program input. We provide more explanation in the "Parameters Estimation" section below for how to determine the swarm size.
Metrics used for evaluation
To evaluate the structure’s consistency with the input Hi-C matrix, we used the metrics below. All these metrics are represented in terms of distance. The evaluations were performed on the pairwise Euclidean distances between each chromosome locus corresponding to the structure generated by ParticleChromo3D and the pairwise distance of the underlying chromosome generated from the input Hi-C data.
Distance Pearson Correlation Coefficient (DPCC)
The Distance Pearson correlation coefficient is as follows [10],
$$PCC=\frac{\sum (\left({d}_{i}-\overline{d }\right)*\left({D}_{i}-\overline{D }\right))}{\sqrt{\sum {\left({d}_{i}-\overline{d }\right)}^{2}* \sum {\left({D}_{i}-\overline{D }\right)}^{2}}}$$
where:
-
\({D}_{i}\) and \({d}_{i}\) are instances of a distance value between two bins.
-
\(\overline{D}\) and \(\overline{d }\) are the means of the distances within the data set.
-
It measures the relationship between variables. Values a between -1 to + 1
-
A higher value is better.
Distance Spearman Correlation Coefficient (DSCC)
The Distance Spearman’s correlation coefficient is defined below [10],
$$SCC= \frac{\sum \left(x{}_{i}-\overline{x }\right)*({y}_{i}-\overline{y })}{\sqrt{\sum {\left({x}_{i}-\overline{x }\right)}^{2}}*\sqrt{\sum {\left({y}_{i}-\overline{y }\right)}^{2}}}$$
where:
-
xi and yi are the rank of the distances,\({D}_{i}\) and \({d}_{i},\) described in the DPCC equation above.
-
\(\overline{x }\) and \(\overline{y }\) are the sample mean tank of both x and y, respectively.
-
Values a between -1 to + 1. A higher value is better.
Distance Root Mean Squared Error (DRMSE)
The Distance Root mean squared error follows the equation below [10],
$$RMSE= \sqrt{\frac{1}{n}*\sum {({d}_{i}-{D}_{i})}^{2}}$$
where:
TM-Score
TM-Score is defined as follows [52, 53],
$$TM-score= MAXIMUM\left[\frac{1}{{L}_{Target}}*{\sum }_{i}^{{L}_{ali}}\frac{1}{1+{(\frac{{\mathrm{d}}_{i}}{{d}_{0}*{L}_{Target}})}^{2}}\right]$$
where:
-
LTarget is the length of the chromosome.
-
di is an instance of a distance value between two bins.
-
Lali Represents the count of all aligned residues.
-
d0 is a normalizing parameter.
The TM-score is a metric to measure the structural similarity of two proteins or models [52, 53]. A TM-score value can be between (0,1] were 1 indicates two identical structures [52]. A score of 0.17 indicates pure randomness, and a score above 0.5 indicates the two structures have mostly the same folds [53]. Hence the higher, the better.
Mean Squared Error (MSE)
MSE is defined below [54]:
$$dMSE= \frac{1}{n}*\sum {({d}_{i}-{D}_{i})}^{2}$$
where:
Huber loss
Huber loss is defined below [55]:
$$Huber Loss=\{ \begin{array}{cc}\frac{1}{2}*{d}^{2},& \left|d\right|\le \alpha \\ \alpha *\left(\left|d\right|-\frac{1}{2}*\alpha \right),& \left|d\right|> \alpha \end{array}$$
where:
-
\(d\) is an instance of distance values from the data and another data source
-
\(\alpha\) is a positive real number used to decide the transition point between the top and bottom loss functions. We varied \(\alpha\) with values of 0.1, 0.5, and 0.9.
Data
Our study used the yeast synthetic or simulated dataset from Adhikari et al., 2016 [27] to perform parameter tuning and validation. The simulated dataset was created from a yeast structure for chromosome 4 at 50 kb resolution [56]. The number of genome loci in the synthetic dataset is 610. We used the GM12878 cell Hi-C dataset to analyze a real dataset, GEO Accession number GSE63525 [57]. The normalized contact matrix was downloaded from the GSDB database with GSDB ID: OO7429SF [58].
Parameters estimation
We used the yeast synthetic dataset to decide on ParticleChromo3D's best parameters. We used this data set to investigate the mechanism for choosing the best alpha conversion factor for input Hi-C data. Also, determine the optimal swarm size; determine the best threshold value for the algorithm, inertia value(w), and the best coefficients for our PSO velocity (\({c}_{1}\) and \({c}_{2}\)). We evaluated our reconstructed structures by comparing them with the synthetic dataset's true distance structure provided by Adhikari et al., 2016 [27]. We evaluated our algorithms with the DPCC, DSCC, DRMSE, and TM-score metrics. Based on the results from the evaluation, the default value for the ParticleChromo3D parameters are set as presented below:
Conversion factor test (\(\mathrm{\alpha }\))
The synthetic interaction frequency data set was generated from a yeast structure for chromosome 4 at 50 kb [38] with an \(\alpha\) value of 1 using the formula: \(IF=1/{D}^{\alpha }\). Hence, the relevance of using this test data is to test if our algorithm can predict the alpha value used to produce the synthetic dataset. For both DPCC and DSCC, our algorithm performed best at a conversion factor (alpha) of 1.0 (Additional file 1: Figure S1). Our algorithm's default parameter setting is that it searches for the best alpha value in the range [0.1, 1.5]. Side by side comparison of the true simulated data (yeast) structure and the reconstructed structure by ParticleChromo3D shows that they are highly similar (Additional file 1: Figure S2).
Swarm size
The swarm size defines the number of particles in the PSO algorithm. We evaluated the performance of the ParticleChromo3D with changes in swarm size (Additional file 1: Figure S3A, S3B, S3C). Also, we evaluated the effect of an increase in swarm size against computational time (Additional file 1: Figure S3D). Our result shows that computational time increases with increased swarm size. Given the computational implication and the algorithm’s performance at various swarm size, we defined a swarm size of 15 as our default value for this parameter. According to our experiments, the Swarm size 10 is most suitable if the user’s priority is saving computational time, and swarm size 20 is suitable when the user's preference is algorithm performance over time. Hence, setting the default swarm size 15 gives us the best of both worlds. The structures generated by ParticleChromo3D also shows that the result at swarm size 15 (Additional file 1: Figure S4C) and 20 (Additional file 1: Figure S4D) are most similar to the simulated data true structure represented in Additional file 1: Figure S2A.
Threshold: optimal parameter to determine structure stability
The threshold parameter is designed to serve as an early stopping criterion if the algorithm converges before the maximum number of iterations is reached. Hence, we evaluated the effect of varying threshold levels using the evaluation metrics (Additional file 1: Figure S5). The output structures generated by each threshold also allow a visual examination of a threshold value (Additional file 1: Figure S6). We observed that the lower the threshold, the more accurate (Additional file 1: Figure S5) and similar the structure is to the generated true simulated data structure in Additional file 1: Figure S2A (Additional file 1: Figure S6F). It worth noting that this does have a running time implication. Reducing the threshold led to a longer running time. However, since this was a trade-off between a superior result and longer running time or a reasonably good result and short running time, we chose the former for ParticleChromo3D. The default threshold for our algorithm is 0.000001. We used this control parameter.
Confidence coefficient (\({c}_{1}\) and \({c}_{2}\))
The \({c}_{1}\) and \({c}_{2}\) parameters represent the local-confidence and local and global swarm confidence level coefficient. Kennedy and Eberhart, 1995 [41] proposed that \({c}_{1}\)= \({c}_{2}=2\). We experimented with testing how this value's changes affected our algorithm's accuracy for local confidence coefficient (\({c}_{1})\) 0.3 to 0.9 and global confidence values 0.1 to 2.8 (Additional file 1: Figure S9 and S10). From our results, we found that a local confidence coefficient (\({c}_{1})\) of 0.3 with a global confidence coefficient (\({c}_{2}\)) of 2.5 performed best (Additional file 1: Figure S7). Hence, these values were set as ParticleChromo3D's confidence coefficient values. The accuracy results generated for all the local confidence coefficient (c1) at varying global confidence values is compiled in Additional file 1: Figure S8.
Random numbers (\({R}_{1}\) and \({R}_{2}\))
\({R}_{1}\) and \({R}_{2}\) are uniform random numbers between 0 and 1 [59].