An important element in our strategy for constructing genetic linkage maps is to exploit problem decomposition. A fundamental motivation for exploiting problem decomposition is to break large problems into smaller problems. This is particularly important if we want to use an exact solver for the TSP. An exact solver might perform reasonably for 2000 markers, but then require an unreasonable amount of time to solve a TSP with 10,000 markers. A natural form of problem decomposition is to separate different groups of markers that are on different chromosomes. For example, if we construct the minimal spanning tree, we may not be able to determine the ordering of all of the genetic markers. However, we may be able to determine that certain groups of genetic markers are on different linkage groups with high probability by examining the minimal spanning tree. In other cases, we may run a heuristic TSP solver, which is less sensitive to problem size, to generate an initial high quality solution. From this initial solution, we may determine that different groups of markers either are on different linkage groups or might be on different linkage groups. By separating the markers into different groups, we can solve each group as a separate instance of a TSP, and then reassemble the solutions for each group into one overall solution. Thus, the proposed solver uses a mix of heuristic methods to generate initial solutions to the TSP instances, and then uses an exact solver on groups of markers that are clearly on the same linkage group.
TSP solvers
TSPmap uses two TSP solvers. Lin-Kernighan-Helsgaun (LKH) is an heuristic local search algorithm based on the Lin-Kernighan algorithm [19, 20]. LKH is able to run in significantly less time than an exact solver. The second is Concorde, an exact solver based on the branch-and-bound method, a technique used to prune the search space and limit unnecessary exploration of areas of the state space which are guaranteed not to produce improvement on an already existing result [18]. Because Concorde is an exact method, it is guaranteed to find the optimal TSP solution. This causes much longer run times than that of LKH, especially as the problem size (number of markers) grows. With TSPmap, LKH is used in the early stages of the mapping process, namely to identify and separate the linkage groups. Once these groups are identified, the final order of markers on each linkage group is determined using the exact solution implemented by Concorde (Fig. 1).
TSPmap algorithm
Computation of pair-wise recombination frequency (rf)
TSPmap uses a matrix of recombination frequencies (rf) between markers to identify linkage groups and order markers. In mapping populations where genotypes are almost entirely homozygous, such as those comprised of recombinant inbred lines (RILS) or generated through double haploidization, TSPmap includes a pipeline for quickly generating rf matrices. First, genotype data in a matrix format are filtered by removing duplicate and heterozygous markers, as well as markers above a user-defined threshold of similarity across all individuals. Since smaller rf values indicate higher correlation between markers, the smaller recombination frequency values are most relevant to the construction of the linkage map. The larger values represent unlinked markers and thus are of limited utility. Because the performance of the TSP solver is slowed by the inclusion of these values, the user may input a cutoff threshold (default set to 0.4) above which all rf values are inflated to 0.5, preventing the solver from spending computation time optimizing these non-informative values. However, setting the threshold too low could begin to affect the clustering and the final ordering of markers within the linkage group. This is especially true for noisy datasets where relatively high rf values may still be informative.
Missing data can cause recombination frequencies to be underestimated or overestimated, depending on whether the missing call represents a difference or similarity between the two markers. To account for this uncertainty, recombination frequency values for markers with missing data are adjusted to lie at the midpoint of the possible rf value range. This prevents the TSP solver from erroneously linking pairs of markers because of missing calls in the data set.
TSPmap can also take as input rf matrices generated by R/QTL [21] for different types of mapping populations, including those with large numbers of heterozygous markers, such as F2 intercrosses and backcrosses. These R/QTL formatted rf matrices can then be used for downstream steps in the TSPmap algorithm. However, it should be noted that creating rf matrices in R/QTL can be time intensive with large numbers of markers, and may slow the overall workflow of generating linkage maps with TSPmap.
Clustering to form linkage groups
In the typical TSP formulation, the solution is a Hamiltonian cycle. That is, the solution is a tour in which each vertex is visited exactly once and ends at the beginning vertex, thus forming a complete cycle. In the case of a genetic linkage map, we have no need to consider the linkage between the last marker in the linkage group and the first. In fact, allowing the TSP to run in this standard configuration will negatively impact the result, since the algorithm will include the value of this last edge when attempting to minimize the total tour cost, and the rf value between the beginning node and ending node are expected to be large since they are on opposite ends on the linkage group. To eliminate this effect we convert TSP problem into a Hamiltonian path problem (HPP). To accomplish this, we introduce into the TSP weight matrix a dummy vertex that has a zero-weight connection to all other vertices. The inclusion of this vertex allows the TSP solver to connect the last vertex in the tour with the first without incurring the large-weight penalty associated with the rf value between two distant vertices, in essence allowing the solver to construct a non-cyclic path [22].
Using the recombination frequency matrix, the first step is to connect the unordered dataset into a minimum spanning tree and then break the spanning tree into large clusters that may contain markers from more than one linkage group. This is because we only want to break the spanning tree at a location that has a very high probability of being a transition between chromosomes. To further aid decomposition, the user inputs an initial estimate of the number of linkage groups (chromosomes), k, into the algorithm as a parameter. Along with the number of markers, m, this is used to define the minimum size of a cluster (s) according to the formula
By calculating s in this way the algorithm is meant to penalize small clusters, but allow for the possibility of large variation in physical size and marker density among chromosomes. However, we acknowledge that the assigned minimum chromosome size may not hold true in species with extreme variation in chromosome size, e.g. birds and some other vertebrates.
The algorithm begins by constructing the minimum spanning tree and breaking it apart at the largest 1.5*k recombination frequency values. If this does not produce k significant clusters, the next-largest rf value is added to the list of cut points between clusters. This is repeated until at least k significant clusters have been produced. In this way, more cuts are made than by simply breaking into the number of known chromosomes, because some chromosomes might not neatly form a single linkage group while other pairs might not have a nice break between linkage groups. Therefore, the algorithm is designed make more cuts, and later test which mergers are best.
Next, each cluster is run through the approximate TSP solver, LKH, and checked for rf values that exceed a user-specified threshold. When such a value is detected, the cluster is broken apart to ensure that the final clusters do not contain markers from more than one linkage group (Fig. 2). This allows the user to control the sensitivity of the linkage group formation by setting what recombination frequency values may exist within a linkage group.
Once the markers have been broken into well-separated clusters, the next step is to merge clusters that belong together as a single linkage group. Clusters are processed in order of increasing size, based on the logic that a small cluster is more likely to be merged with a larger cluster. Each cluster is processed through three stages to determine with which cluster, if any, it should be merged.
The first stage involves examination of the rf matrix to see if the cluster can be directly merged with another without any further analysis. The second stage involves combining clusters pairwise and processing them with LKH to see if the Hamiltonian path clearly indicates whether the clusters should be merged, based on a merger having a maximum rf less than half of the next smallest merger (Fig. 3). Finally, if a cluster does not pass the criterion for being merged in stage 2, the third stage attempts to verify which cluster it belongs to using a reciprocal matching technique of the most likely candidate cluster. If a cluster does not meet any of these criteria, it is considered to be a complete linkage group.
Ordering of markers
The final clusters (linkage groups) are processed using Concorde to produce the optimal linkage map for the given data. Because Concorde is an exact solver it is able to find the exact TSP solution to the order of markers in within each linkage group. Finding an exact solution does not guarantee that the order of the markers is absolutely correct; instead, the exact solution is the best solution possible given the available recombination frequencies. Experimental datasets can contain a few markers that are highly erroneous across the entire population. Once TSPmap has determined the marker order, these highly erroneous markers can be identified and dropped using tools available in R/QTL (i.e. droponemarker function) [21] prior to subsequent analyses such as QTL mapping.
Implementation in R
This procedure was implemented in R, version 3.2.5 [23]. The algorithms that identify duplicate markers and compute the recombination frequency matrix were implemented in C and are called via R’s built-in C interface. The TSPmap R package is freely available and includes a user tutorial vignette and example datasets.
Simulations, comparative mapping algorithms and measures of performance
Datasets representing a mapping population of 300 recombinant inbred lines were simulated using the R/QTL package [21]. Datasets were generated in a factorial design in which each dataset was comprised of a total number of markers (m) of 1000 or 4000, evenly distributed across five linkage groups or 10,000 distributed across 10 linkage groups, a genotype error rate (η) of 0.0, 0.01, or 0.05, and a missing genotype rate (γ) of 0.0, 0.05, or 0.10. Five replicates of each configuration were created.
For each simulated marker dataset we generated linkage maps with TSPmap. An example script for implementation of TSPmap with these data can be found in Additional file 1: File S1. For comparison, datasets were also run on JoinMap 4.0 [12] and MSTmap [13]. For JoinMap, the maximum likelihood algorithm was used, as the regression method gave unreasonably long runtimes for the 4000-marker datasets, and the results for 1000-marker datasets were determined to be higher quality using the maximum likelihood algorithm. Default parameters were used, except values were increased to 5000 for chain length, 10,000 for number of chains without improvement before stopping, and 1000 for chain length per Monte Carlo EM cycle. Additionally, we created linkage maps with these simulated data using MSTmap, which uses a minimum spanning tree algorithm [13]. The exact settings used to create linkage maps with simulated datasets in MSTmap can be found in Additional file 2: File S2, which is an input file for a simulated dataset into MSTmap.
To quantitatively measure the quality of a solution, we compared the final number of linkage groups (chromosomes) c, to the true number of chromosomes in the simulated data. Additionally, we calculated the number of erroneous pairs, E, in each solution. This number is the number of pairs of markers which appear in reversed order in their estimated positions as compared to their true positions in the simulated data [13].