 Research
 Open Access
 Published:
Sparse selfattention aggregation networks for neural sequence slice interpolation
BioData Mining volume 14, Article number: 10 (2021)
Abstract
Background
Microscopic imaging is a crucial technology for visualizing neural and tissue structures. Largearea defects inevitably occur during the imaging process of electron microscope (EM) serial slices, which lead to reduced registration and semantic segmentation, and affect the accuracy of 3D reconstruction. The continuity of biological tissue among serial EM images makes it possible to recover missing tissues utilizing interslice interpolation. However, large deformation, noise, and blur among EM images remain the task challenging. Existing flowbased and kernelbased methods have to perform frame interpolation on images with little noise and low blur. They also cannot effectively deal with large deformations on EM images.
Results
In this paper, we propose a sparse selfattention aggregation network to synthesize pixels following the continuity of biological tissue. First, we develop an attentionaware layer for consecutive EM images interpolation that implicitly adopts global perceptual deformation. Second, we present an adaptive stylebalance loss taking the style differences of serial EM images such as blur and noise into consideration. Guided by the attentionaware module, adaptively synthesizing each pixel aggregated from the global domain further improves the performance of pixel synthesis. Quantitative and qualitative experiments show that the proposed method is superior to the stateoftheart approaches.
Conclusions
The proposed method can be considered as an effective strategy to model the relationship between each pixel and other pixels from the global domain. This approach improves the algorithm’s robustness to noise and large deformation, and can accurately predict the effective information of the missing region, which will greatly promote the data analysis of neurobiological research.
Background
Interslice interpolation is important in electron microscope (EM) image analysis. The destruction of the biological tissues during sample preparation and EM imaging can cause largearea defects in serial EM images. Recent methods based on context information are effective for small area defects, but cannot handle large area. The continuity of the biological tissue of the serial slices contributes to predicting the missing information despite the failure of using only the spatial information of a single EM image. To date, there are only few reported works available in the field of sequence slice interpolation [1–3], and the EM image restoration methods are not effective when dealing with largearea defects. However, interpolation methods can accurately predict nondefective intermediate frames, which can replace original intermediate frames with largearea defects. Besides, nondefective intermediate images contribute to improve registration with sudden and significant structural changes, improve semantic segmentation accuracy [4], and ensure 3D reconstruction continuity [5].
For optical images, with the development of deep learning, frame interpolation has gone through five stages: simple CNNbased methods [6], deep voxel flowbased methods [7], kernelbased methods [8, 9], motionbased methods [10, 11] and depthbased methods [12]. Long et al. [6] first adopted a generic CNNbased network synthesizing the intermediate frame directly. However, the results suffer from severe blurriness since generic CNN cannot obtain the multimodal distribution of optical images and videos. Then, Liu et al. [7] proposed the deep voxel flow to warp input frames based on a trilinear sampling. Although the intermediate frames generated from voxel flow suffer low blurriness, the procedure of flow estimation remains a challenge for large motion.
Instead of adopting optical flow to handle significant motion, Niklaus et al. [8, 9] proposed a spatiallyadaptive interpolation kernel to synthesize pixels from a large neighborhood. However, these kernelbased methods only build dependencies from local areas and typically require heavy computation cost when the size of kernel increases. Then, Bao et al. [11] integrate kernelbased and flowbased approaches into an endtoend network to benefit from both sides. Recently, Bao et al. [12] further introduced depth estimation to the previous work, which explicitly deals with occlusion. Existing flowbased methods utilize kernel estimation to improve the precision and robustness of singlepixel synthesis. However, pixels synthesized by kernel estimation only consider local neighborhood information. In general, existing interpolation methods deal with occlusion, significant motion, adopting depth maps [13–15], optical flow, and local interpolation kernels. However, on EM images with large deformation, drift, and abundant noise, estimation of the accurate optical flow field [16–18] suitable for sequence EM images remains a challenge. Furthermore, the ultimate goal is to synthesize highquality intermediate frames without defect, and optical flow estimation is used only as an intermediate step. Kernelbased methods [8, 9] behave well owing to combine flow estimation and pixel synthesis into a single step. The kernel estimation synthesizes the intermediate frame pixels through the spatial kernel based on the traditional convolution. However, it cannot establish the dependence of the global domain, and when the spatial kernel is expanded to the input image size, the computation and memory complexity is not lower than the original selfattention mechanism.
Focusing on finding more accurate deformation fields on EM images, early researchers also proposed a series of traditional slice interpolation methods, including shapebased methods [19], morphologybased methods [20], registrationbased methods [21]. However, the implementation of these conventional methods was based on an essential assumption that changes in the structure must be sufficiently small. This assumption makes the methods mentioned above unsuitable for sparsely sampled slices. Recently, with the development of deep learning [22–24], there are several CNNbased slice interpolation methods. [1] proposed a simple convolutional autoencoder for binary image interpolation. Then, Nguyen et al. [3] leveraged slice interpolation to improve registration. Slice interpolation was only an auxiliary part of registration. Wu et al. [2] proposed an Intermediate slice synthesis model for boosting medical image segmentation accuracy. The slice synthesis model was based on the kernel estimation method [9] and cannot handle noise and blur differences, large deformation, and drift among EM images.
Recently, the selfattention mechanism has become an integral part of the entire model, to establish global dependency for each position. Selfattention, also called intraattention, was originally proposed to calculate the response at a position in a sequence, then it was first plugged into machine translation [25], achieving stateoftheart results. Parmar et al. [26] proposed an Image Transformer model adopting selfattention for image generation. Wang et al. [27] proposed nonlocal operations to model the spatialtemporal dependencies in various computer vision tasks, e.g. video classification, object detection, and instance segmentation. Recently, some researchers [28–30] applied a similar mechanism for semantic segmentation and achieved good performance.
Despite its progress, selfattention has not been applied in neural sequence slice interpolation. Inspired by the works above, we propose a simple and efficient multilevel sparse strategy to decompose the original affinity matrix of the selfattention mechanism into the product of two sparse affinity submatrices, and we apply the interlacing mechanism to group the pixels with long spatial interval distances together for the longrange attention. If the size of the sparse affinity submatrix is larger than the threshold, the submatrix continues to decompose itself in the same way. Notably, the concurrent works, Sparse Transformer [31] and Interlaced Sparse SelfAttention [30] also adopt similar factorization scheme to improve the efficiency of selfattention on sequential tasks and semantic segmentation while we focus on consecutive EM image interpolation. In contrast, we implicitly detect global deformation and integrate pixels from global dependency by utilizing the selfattention information in the attentionaware layer. Furthermore, We utilize the multilevel sparse strategy further to improve the computational efficiency of the selfattention mechanism. Moreover, we replace traditional kernel estimation with the proposed attentionaware layer to synthesize pixels from global dependency.
To address the problems above, we introduce a simple and efficient solution named attentionaware layer (AAL). The AAL perceives all positions in the input frames, and then synthesize each pixel of the intermediate frame according to the attention maps. In more detail, AAL learns to focus on global deformation without additional supervision, implicitly considering all positions in the input frames to generate each pixel in the middle frame. In this way, the optical flow field extraction and kernel estimation can be reasonably removed while maintaining the intermediate frame accuracy. Besides, a twolevel sparse selfattention mechanism in AAL decreases the computation and memory complexity substantially. Considering the style differences in the degree of noise and blur between the serial EM images, we also propose an adaptive stylebalance loss, which strengthens the supervision of input frames and ensure the natural transition of three consecutive frames. As a result, our proposed approach performs better than other methods on EM images.
In this paper, we propose a novel AAL that can perfectly replace the kernel estimation layer and establish a global dependence for each pixel. Additionally, we explore the effects of different loss functions on the EM image interpolation task. In particular, we propose an adaptive style balance loss. The main contributions can be roughly grouped in three different directions.

We present an attentionaware layer to capture the dense longrange dependencies for each pixel with lower memory and computation consumption. The proposed module improves performance compared to kernelbased methods and flowbased methods. Moreover, our approach combines flow estimation and kernel estimation into a single step

We propose the style balance loss to handle differences in style among three input consecutive EM images. The proposed loss not only guides the style of generating intermediate frames to be closer to the ground truth but also utilizes the styles of the front and rear frames to strengthen the constraints on the intermediate frame style. We show that using front and rear frame styles for supervision can better generate intermediate frames with natural transition.

Based on CREMI^{Footnote 1} dataset, provided by MICCAI 2016 Challenge as serial section transmission electron microscopy (ssTEM) images, we generate a new dataset, named cremi_triplet, for the task EM image interpolation. Besides, we also generate a dataset named mouse_triplet for interpolation based on automatic tapecollecting ultramicrotome (ATUM) mouse brain data. Experimental results demonstrate the effectiveness of the attention aware layer on EM image interpolation, which is superior to the kernelbased methods and flowbased methods.
Materials and methods
In this paper, we propose a sparse selfattention aggregation network (SSAN) for EM image interpolation. An overview of the proposed attentionaware interpolation algorithm is shown in Fig. 1, which is primarily based on the siamese residual dense network, attentionaware layer, and hybrid network. Given two input frames I_{t−1} and I_{t+1}, the goal is to synthesize an intermediate frame \(\hat {\mathbf {I}}_{t}\). We first encode the feature maps, denoted by F_{t−1→t+1} and F_{t+1→t−1}, through siamese residual dense network. Then, the proposed attentionaware layer synthesizes the warped frames warp_{0} and warp_{1} based on F_{t−1→t+1} and F_{t+1→t−1}. After obtaining the warped frames, the proposed hybrid network generates the interpolated frame \(\hat {\mathbf {I}}_{t}\) by elementwise linearly fusing.
Dataset and preprocessing
Note that there is no public dataset for the EM image interpolation task, we generate the ground truth of two major types of EM images: ssTEM images and ATUM images. To be specific, we use the ssTEM images from the CREMI dataset provided by MICCAI 2016 Challenge on https://cremi.org/, and the ATUM images generated from our home grown mouse brain dataset. The CREMI dataset consists of three datasets, each consisting of two 5um^{3} volumes (training and testing, each 1250pixel×1250pixel×125pixel) of serial section EM of the adult fly brain. Each volume has neuron and synapse labelings and annotations for pre and postsynaptic partners. Taking CREMI’s padder version A dataset as an example, we first convert the hdf5 format A dataset into a png format to obtain 200 images with a resolution of 3072×3072. After that, we utilize the template matching algorithm to align the three consecutive images. Then we traverse from left to right from top to bottom with the stride of 512 and crop the three consecutive images with the resolution of 512×512 after alignment, and save as a sample. Finally, samples with defects, weak continuity and substantial differences in blurring are all deleted. To reduce the difference in brightness and contrast between three consecutive images in each sample, we perform histogram specification operations on both two datasets. The processed CREMI dataset and mouse brain dataset are named as cremi_triplet and mouse_triplet, respectively. Each dataset adopts a triplet as a sample for training, where each triplet contains three consecutive EM images with a resolution of 512×512 pixels. There are 3,652 triplets, 2,631 triplets, 1,333 triplets and 2674 triplets in the cremi_triplet A, cremi_triplet B, cremi_triplet C and mouse_triplet, respectively. Each dataset is divided into a training set, validation set and test set in a ratio of 3:1:1.
Siamese residual dense network
For feature extractor, the pooling process in the UNet can damage context information, making it difficult for intermediate frame synthesis. We utilize the residual dense network [32] as the basic feature extractor to preserve the structured information when generating the corresponding hierarchical features of the input frames. As shown in Fig. 1, residual dense network (RDN) mainly consists of three parts: shallow feature extraction net (SFENet), residual dense blocks (RDBs), and finally dense feature fusion (DFF). Besides, the frame interpolation task requires two consecutive frames as input to generate intermediate frames. Here, the siamese structure is adopted, as illustrated in Fig. 1, which preserves the temporal information between consecutive frames during generating hierarchical features and contributes to decreasing the computational consumption.
Attentionaware layer
After extracting hierarchical features through the siamese residual dense network, the proposed AAL based on multilevel sparse selfattention replaces the kernel estimation. The key of the proposed multilevel sparse selfattention lies in the multilevel decomposition of the original dense affinity matrix A, each time decomposing the dense affinity matrix A into the product of two sparse block affinity matrices A^{L} and A^{S}. By combining multilevel decomposition, longrange attention, and shortrange attention, pixels of each position can be synthesized from the information of all input positions. We demonstrate how to estimate the longrange attention matrix A^{L} or the shortrange attention matrix A^{S} and perform multilevel decomposition in Fig. 2.
Selfattention
The selfattention [33] scheme is described as below,
In the above formulation, \(\mathbf {X}\in {\mathbb {R}^{C \times N}}\) is the input feature map, \(\mathbf {A}\in {\mathbb {R}^{N \times N}}\) is the dense affinity matrix, and \(\mathbf {Z}\in {\mathbb {R}^{C \times N}}\) is the output feature map. \(\mathbf {W_{{g}}, W_{{f}}, W_{{h}}}\in {\mathbb {R}^{\Bar {C}\times C}}\), and \(\mathbf {W_{{v}}}\in {\mathbb {R}^{C \times \Bar {C}}}\) are the learned weight matrices, which are implemented as 1×1convolutions. This mechanism reduces the channel number of C ̄ to be C/k, where k=1,2,4,8. The scaling factor d is used to solve the small gradient problem of softmax function according to [25] and \(d=\frac {C}{2}\).
In addition, the output of the attention layer is multiplied with a scale parameter and add back the input feature map. Therefore, the final output is,
where γ is a learnable scalar and it is initialized as 0. Introducing the learnable γ allows the network to first rely on the cues in the local neighborhoodsince this is easierand then gradually learn to assign more weight to the nonlocal evidence. As shown in Fig. 3, we find that in the training phase, the critical parameter γ slowly increases from the initial value zero with a small slope, then the increasing rate gradually becomes larger, and finally, the curve becomes stable.
Longrange attention
Longrange attention applies the selfattention on the subsets of positions that satisfy long spatial interval distances. As shown in Fig. 2, a permutation is first adopted on the input feature map X to compute X^{L}=Permute(X). Then, X^{L} is divided into \(\mathcal {P}\) parts and each part contains \(\mathcal {Q}\) adjacent positions(\(N=\mathcal {P}\times \mathcal {Q}\)). Here,
where \(p=1,...,\mathcal {P}\), each \(\mathbf {X}_{p}^{L} \in \mathbb {R}^{\mathcal {C} \times \mathcal {Q}}\) is a subset of \(\mathbf {X}^{L}, \mathbf {A}_{p}^{L} \in \mathbb {R}^{\mathcal {Q} \times \mathcal {Q}}\) is the sparse affinity matrix based on all the positions from input feature map \(\mathbf {X}_{p}^{L}\) and \(\mathbf {Z}_{p}^{L} \in \mathbb {R}^{\mathcal {C} \times \mathcal {Q}}\) is the updated output feature map based on input feature map \(\mathbf {X}_{p}^{L}\). All other parameters including W_{f},W_{g},W_{v},W_{h},d are the same as “Selfattention” section. Finally, all the \(\mathbf {Z}_{p}^{L}\) is merged to acquire the output feature map Z^{L} in (7). From the equations above, we demonstrate the actual affinity matrix of longrange attention as below,
The equation shows that only the small affinity blocks in the diagonal are nonzero.
Shortrange attention
Shortrange attention applies the selfattention on the subsets of positions that satisfy short spatial interval distances. The decomposition principle is similar to the longrange attention mechanism.
Multilevel decomposition
The combination of longrange attention and shortrange attention can effectively model global dependence. However, the computation of the small affinity matrix \(\mathbf {A}_{p}^{L}\) from longrange attention is still not very efficient. We continue to decompose the subfeature map \(\mathbf {X}_{p}^{L}\). Here, we only perform twolevel decomposition. As illustrated in Fig. 2, we first adopt a permutation on the input feature map X to compute X^{L} and divide X^{L} into \(\mathcal {P}\) parts. Second, we apply a permutation on the input subfeature map \(\mathbf {X}_{p}^{L}\) to compute \(\mathbf {X'}_{p}^{L}=Permute\left (\mathbf {X}_{p}^{L}\right) = \left [{\mathbf {X'}_{p1}^{L}},{\mathbf {X'}_{p2}^{L}},...,{\mathbf {X'}_{p\mathcal {P'}}^{L}}\right ],N'=\mathcal {P' \times Q'}\). Parameters here are similar to longrange attention. Then, we repeat the previous longrange attention and shortrange attention steps in sequence, to calculate \(\mathbf {Z}_{p}^{L_{L}}\) and \(\mathbf {Z}_{p}^{L_{S}}\). After acquiring the updated output feature map based on the input subfeature map \(\mathbf {X}_{p}^{L}\), we merge all the \(\mathbf {Z}_{p}^{L_{S}}\) to acquire the output feature map Z^{L}. Finally, the output feature map Z^{S} can be obtained through performing shortrange attention on Z^{L} directly.
Complexity of attentionaware layer
Given the input feature map of size H×W×C, we analyze the computation cost of the selfattention [33], interlaced sparse selfattention [30] and our proposed method.
The complexity of selfattention is
the complexity of interlaced sparse selfattention is
and the complexity of our proposed method is
Where we divide the height dimension into \(\mathcal {P}_{h}\) parts and the width dimension into \(\mathcal {P}_{w}\) parts in longrange attention and \(\mathcal {Q}_{h}\) and \(\mathcal {Q}_{w}\) in shortrange attention in the first level. In the second level, we divide the height \(\mathcal {Q}_{h}\) and width \(\mathcal {Q}_{w}\) again as the first level. Here, \(H=\mathcal {P}_{h}\mathcal {Q}_{h}, W=\mathcal {P}_{w}\mathcal {Q}_{w}, \mathcal {Q}_{h}=\mathcal {P'}_{h}\mathcal {Q'}_{h}, \mathcal {Q}_{w}=\mathcal {P'}_{w}\mathcal {Q'}_{w}\). The complexity of interlaced sparse selfattention [30] can be minimized to \(\mathcal {O}(4HWC^{2}/k+3(HW)^{\frac {3}{2}}C/k)\) when \(\mathcal {P}_{h}\mathcal {P}_{w}=(HW)^{\frac {1}{2}}\). And the complexity of our method can be minimized to \(\mathcal {O}(12HWC^{2}/k+6(HW)^{\frac {4}{3}}C/k)\) when \(\mathcal {P}_{h}\mathcal {P}_{w}=(HW)^{\frac {1}{3}}\). It can be seen that our method has significantly lower computational complexity in processing highresolution images than the first method.
Loss function
EM images are quite different from natural images. These images have the characteristics of abundant noise and varying degrees of blur, which determine that general loss functions are not suitable for consecutive EM image interpolation. The loss function for training our network is a combination of style balance loss\(\mathcal {L}_{bs}\), feature reconstruction loss\(\mathcal {L}_{f}\), and pixelwise loss\(\mathcal {L}_{1}\). In all our experiments, ϕ is the 16layer VGG network pretrained on ImageNet [34]. Specifically, we define the total loss as
where the scalar α_{1},α_{2},α_{3} are the trade off weight, and the constant α_{1} is 10^{6}, the constants α_{2}=α_{3}=1.
The proposed style balance loss aims at strengthening the supervision of the style of the middle frame, adopting the style of the front and back frames. The style reconstruction loss [35, 36] only ensures the style consistency between the generated intermediate frame and ground truth, ignoring the difference in style of consecutive EM images. Affected by the complex imaging environment of scanning electron microscopy, there are certain differences in the style of three consecutive EM images, such as blur, noise, brightness, and contrast. Considering that only frame t−1 and frame t+1 are input in the testing phase, we hope that the intermediate frames t generated by the model in the testing phase can take the styles of frame t−1 and frame t+1 into account and generate the intermediate frame t^{′} with natural style transitions. For this reason, we introduce style balance loss into the training phase to achieve a better balance between style transition and the style of ground truth. Here, we define the Gram matrix to be the C_{j}×C_{j}matrix and ϕ_{j}(x) is the output of jth activation layer in vgg16 for the input x, the elements in Gram matrix are given by
The style reconstruction loss is defined as below
The style balance loss is defined as
where scalar β_{0},β_{1},β_{2} are the trade off weights and empirically set to 0.1,1,0.1 in turn. While making the style of the generated intermediate frame close to the ground truth, it also ensures the supervision of the frame 0 and frame 2 on the style of the generated intermediate frame 1^{′}. \(\mathcal {L}_{style_{1'j}}\) denotes style reconstruction loss between intermediate frame 1^{′} and input frame j, \(\mathcal {L}_{style_{1j}}\) denotes style reconstruction loss between ground truth 1 and input frame j. If the frame 1 has a very different style from the frame 2, the value of the \(\mathcal {L}_{style_{12}}\) is very large. When the style of frame 1^{′} tends to frame 1 and \(sign\left (\mathcal {L}_{style_{1'2}}\mathcal {L}_{style_{12}}\right)=1\), the third term in balanced loss is large and positive, and the network makes the style of frame 1^{′} approach frame 2 in a large slope; when the style of frame 1^{′} tends to frame 1 and \(sign\left (\mathcal {L}_{style_{1'2}}\mathcal {L}_{style_{12}}\right)=1\), the third term is very large in absolute value and negative, and the network makes the style of frame 1^{′} move away from frame 2 in a large slope; If the frame 1 has a similar style to the frame 2, the value of \(\mathcal {L}_{style_{1'2}}\) is very small. When the style of frame 1^{′} tends to frame 1, the third term in balance loss is close to zero and can be ignored; So either frame 0.
Feature reconstruction loss aims at achieving more realistic results, we adopt the feature reconstruction loss [35] to encourage the synthesized image \(\hat {\mathbf {I}}_{t}\) and the groundtruth \(\mathbf {I}_{t}^{gt}\) to have similar feature representations, defined as:
where C_{j}×H_{j}×W_{j} is the shape of output feature map ϕ_{j}(x).
Pixelwise loss aims at reducing the pixelwise divergence between intermediate frames and ground truth. Here, we select Charbonnier loss [37] to resist outliers. The Charbonnier loss can be presented as:
where \(\hat {\mathbf {I}}_{t}(x)\) is the synthesized frame, \(\mathbf {I}_{t}^{gt}(x)\) is the groundtruth frame, \(\rho (x)=\sqrt {x^{2}+{\epsilon }^{2}}\) is the Charbonnier penalty function, and the constant ε is 10^{−6}.
Parameters in SSAN
The proposed models are optimized using the Adam [38] with the β_{1} of 0.9 and β_{2} of 0.999. We set the batch size to 3 with synchronized batch normalization. The initial learning rates of the proposed network are set to 10^{−3}. We train the entire model for 30 epochs and then reduce the learning rate by a factor of 0.1 and finetune the entire model for another 20 epochs. Training requires approximately three days to converge on one Tesla K80 GPU. The whole SSAN framework is implemented using PyTorch.
For data augmentation, we randomly flipped the cropped patches horizontally or vertically and randomly swap their temporal order, for all datasets. All input images are randomly cropped to 512×512.
Results
In this section, we first conduct an ablation study to analyze the contribution of the proposed loss function, feature extractor, and attentionaware layer. Then, we analyze the advantages of the proposed approach. Finally, we compare the proposed model with stateoftheart algorithms on different EM datasets. The average Interpolation Error (IE), Peak Signal to Noise Ratio (PSNR), structural similarity (SSIM), graphics memory (Memory), Floating point number Operations Per Second (FLOPs), Module parameters (Params), run time (RunTime), Dice score (Dice) [39] and F1 score (F1) are computed for comparisons. Lower IEs, Memory, FLOPs and RunTime indicate better performance.
Loss function analysis
As shown in Fig. 4, We find that the three subitems of the balance loss all tend to gradually approach zero from a larger value. The difference is that the first term and the third term gradually decrease from a small positive value to approach zero, the second term gradually decreases from a large positive value to approach zero, and the first and third terms are smaller than the second one and fluctuate around zero. The trend of the subitems reflects in the figure above matches the style balance loss we have proposed.
The proposed method incorporates three types of loss functions: pixelwise loss \(\mathcal {L}_{1}\), feature reconstruction loss \(\mathcal {L}_{f}\) and style reconstruction loss \(\mathcal {L}_{s}\). To indicate their respective effects, three different loss functions are adopted to train the proposed network. The first one only applies \(\mathcal {L}_{1}\) loss and represents this network as “ L_{1}”. The second one applies both \(\mathcal {L}_{1}\) loss and \(\mathcal {L}_{f}\) loss in a linear combination and represents this network as “ L_{f}”. The third one applies \(\mathcal {L}_{1}\) loss, \(\mathcal {L}_{f}\) loss, and \(\mathcal {L}_{bs}\) loss in a linear combination and represents this network as “ L_{s}”. As shown in Fig. 5, the last “ L_{s}” leads to the best visual quality and rich texture information. Results generated by “ L_{s}” are visually pleasing with more highfrequency details. Despite slight deviation from the ground truth positions, results generated by “ L_{s}” are consistent with biological tissue continuity. Also, the style of the images is almost the same as the ground truth. As a result, the proposed network adopts this scheme as the loss function.
Model analysis
In this subsection, we analyze the contribution of the two key components in the proposed model: the siamese residual dense network (SRDN) and the attentionaware layer (AAL).
Siamese residual dense network
To validate the effectiveness of the SRDN, we compare it with other famous feature extractors, including the UNet, the siamese UNet (SUNet), and the residual dense network (RDN) on the cremi_triplet datasets and mouse_triplet dataset. As shown in Table 1, the proposed SRDN feature extractor outperforms previous stateoftheart feature extractors, achieving almost the best performance on PSNR, SSIM, and IE. Specifically, we demonstrate that the siamese structure, especially the siamese structure of RDN, leads to a substantial improvement on the cremi_triplet A and mouse_triplet in terms of PSNR and IE. We also find that RDN without siamese structure performs worse than UNet, but under the siamese structure, the performance of RDN is significantly improved, which is better than Unet and SUNet. We also notice the phenomenon that the Siamese structure does not help when applied to UNet on the cremi_tripletC dataset. Compared with cremi_tripletA and cremi_tripletB, the deformation between three consecutive slices in the cremi_tripletC dataset is more complicated and changes drastically, which puts higher requirements on the network depth. The depth of the UNet backbone is relatively shallow, and the Siamese structure would introduce bidirection temporal information. However, pooling operation in UNet leads to more information loss on the bidirection deformation information than the singledirection, which accounts for the worse results on cremi_tripletC when applied Siamese structure to UNet.
Attentionaware layer
We demonstrate the superiority of the proposed AAL from two aspects: computational complexity and model effects. For the complexity of the attention perception module, all the numbers are tested on a single P40 GPU with cuda10.2, and the input feature map resolution is 1×64×512×512. As shown in Table 2, it can be seen that the proposed AAL only uses 23.8% GPU memory and 12.8% FLOPs compared with the SSA. Besides, the running time of our method is 275 ms, which is 52 ms faster than SSA. The results sufficiently demonstrate that the computation and memory complexity of proposed method are substantially lower than other selfattention methods. To validate the effectiveness of the proposed attentionaware layer, the feature extractor adopts a siamese residual dense network. After the feature extractor, we append the classic kernel estimation layer, the stateoftheart interlaced sparse selfattention layer, and the proposed attentionaware layer, respectively. For the implementation of the selfattention, we directly utilize the opensource code [33]. As shown in Table 3, the proposed AAL shows an improvement on the cremi_triplet and mouse_triplet datasets, against both KEL and SSA. Especially on the mouse_triplet, AAL outperforms SSA with a 0.18dB gain in terms of PSNR. Meanwhile, the interpolation error (IE) is 0.6 lower than SSA.
Analysis of the proposed approach
As shown in Fig. 6, qualitative visualization results on cremi_triplet B can demonstrate the superiority of the proposed method. The intermediate EM images generated by our proposed method are almost the same as ground truth, in terms of image style, biological tissue continuity, and content texture. The proposed attention perception layer can synthesize each pixel of the intermediate frame from the global domain. Thus, the proposed approach is robust against large deformations, drifts, and noise. We can observe that in the case of many discontinuous pixels in the input frame and the ground truth, this approach can produce ideal results.
Comparisons with stateofthearts
We conducted quantitative and qualitative experiments on the proposed approach and the baseline to prove that the proposed method is superior to the baseline. For quantitative experiments, we adjust the loss function of both the proposed method and the baseline to the style balance loss introduced in this paper and conduct quantitative experiments under the same experimental environment. In Table 4, we provide quantitative performances on the cremi_triplet A, cremi_triplet B, cremi_triplet C, and mouse_triplet dataset. The proposed approach performs favorably against all the compared methods for all the datasets, especially on the mouse_triplet dataset with a 2.48dB gain over SepConvL _{s} [9] in terms of PSNR. For qualitative experiments, all methods are conducted under the same experimental environment and apply different loss functions to intuitively demonstrate the effect of varying loss functions on EM images. We evaluate the proposed SSAN against the following CNNbased frame interpolation methods: SepConvL_{1} [9], SepConvL _{f} [9], SepConvL _{s} [9], DAINL_{1} [12] and DAINL _{s} [12], in terms of PSNR, SSIM and IE. As shown in Fig. 7, the DAINL_{1} [12] and DAINL _{s} [12] cannot handle the large deformation well and thus produce ghosting and broken results. Moreover, we can see the lack of some edges from the enlarged image. It confirms that flowbased methods perform poorly on EM images. The SepConvL_{1} [9] and SepConvL _{f} [9] methods generate blurred results on membrane structure and mitochondria. The result generated by SepConvL _{s} [9] also lacks critical edge information, and there are black noise and white areas inconsistent with the continuity of biological tissue, especially around mitochondria. In contrast, the proposed method handles large deformation well and generates clearer results with complete contour.
Segmentation performance comparisons with stateofthearts
Table 5 shows the segmentation accuracy attained by each method. In all cases, our proposed SSAN algorithm performs best than the other two methods, both in Dice score and F1 score. Because it uses not only the SRDN to avoid the information loss caused by the pooling operation and retain the temporal information, but also the sparse selfattention to synthesize each pixel considering the longrange dependence. Thus, the interslice generated by SSAN can produce clear and accurate membrane boundaries and fewer artifacts despite large deformation, noise, and blur. From Fig. 8, we visualize the membrane segmentation results of intermediate images generated by different methods using the same L _{s} loss. It can be seen that the intermediate frame generated by the proposed method has fewer artifacts and the membrane boundary is more complete. The flowbased method is unstable on the EM images, and even produces severe white spots.
Discussion
In this work, we consider the sparse selfattention mechanism and discuss how to introduce this selfattention into consecutive EM image interpolation tasks. On EM images with large deformation, drift, and abundant noise, each pixel of the intermediate frame is aggregated from all positions in the input frame using a selfattention mechanism. Specifically, we highlight three aspects: feature extraction module, attention perception mechanism, and style balance strategy. We found that UNet’s pooling can damage content information, and the residual dense blocks commonly adopted in superresolution preserve the integrity of content information. The Siamese structure in the feature extractor enables the network to extract the temporal information among the input frames. We empirically observe that a twolevel sparse strategy decreases the computation and memory complexity substantially while performing better in synthesizing pixels from the input frames’ global domain. Given the input feature map of size H×W×C, the complexity of interlaced sparse selfattention [30] can be minimized to \(\mathcal {O}(4HWC^{2}/k+3(HW)^{\frac {3}{2}}C/k)\), and the complexity of the proposed method can be minimized to \(\mathcal {O}(12HWC^{2}/k+6(HW)^{\frac {4}{3}}C/k)\). Thus, our method has a significantly lower computational complexity than the first one while entering high resolution feature maps. After obtaining warped frames, we found that only simple averaging does not give good results. The sigmoid function is a good alternative that generates a weight mask for elementwise linearly fusing two warped frames to synthesize the interpolated frame. We observe that selecting a suitable loss function for training models on EM images is much more complicated than natural images, especially to be robust against large deformations, drifts, and noise. We conducted a combined experiment on style loss, perceptual loss, and pixel loss, and found that the ratio 10^{6}:1:1 produced the most realistic results. We further proposed an adaptive style balance loss to ensure a natural transition of three consecutive frame styles. Finally, our proposed approach performs better than other methods on EM images, produced by ssTEM and ATUM.
In the future, one option is to sparse from the selfattention mechanism to reduce computation and memory consumption further. Another option is to optimize the loss further and propose a novel loss that is more suitable for the task of EM image interpolation. The last improvement direction is to design a sparse global domain kernel estimation method.
Conclusion
In this paper, we propose a novel attentionaware consecutive EM image interpolation algorithm that combines motion estimation and frame synthesis into a single process by adopting the AAL. The proposed AAL implicitly detects the large deformations using the selfattention information and synthesize each pixel by effectively establishing longrange dependencies from input frames. The AAL entirely replaces the traditional kernel estimation convolution method with low memory and computational consumption. We also exploit the SRDN as the feature extractor to learn hierarchical features and reduce the parameters. Furthermore, the proposed adaptive style balance loss takes the style information of input EM images into consideration, generating more realistic results. Our SSAN performs more favorably on EM images than flowbased methods due to integrating flow estimation and pixel synthesis through the attentionaware mechanism. The experiments on ssTEM and ATUM images show that the proposed approach compares favorably to stateoftheart interpolation methods, both quantitatively and qualitatively, and generates highquality frame synthesis results.
Availability of data and materials
The data and source code in this paper are available online.
Notes
References
 1
Afshar P, Shahroudnejad A, Mohammadi A, Plataniotis KN. Carisi: Convolutional autoencoderbased interslice interpolation of brain tumor volumetric images. In: 2018 25th IEEE International Conference on Image Processing (ICIP). Athens: IEEE: 2018. p. 1458–62.
 2
Wu Z, Wei J, Yuan W, Wang J, Tasdizen T. Interslice image augmentation based on frame interpolation for boosting medical image segmentation accuracy. arXiv preprint arXiv:2001.11698. 2020.
 3
NguyenDuc T, Yoo I, Thomas L, Kuan A, Lee WC, Jeong WK. Weakly supervised learning in deformable em image registration using slice interpolation. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019). Venezia: IEEE: 2019. p. 670–3.
 4
Xie Q, Chen X, Deng H, Liu D, Sun Y, Zhou X, Yang Y, Han H. An automated pipeline for bouton, spine, and synapse detection of in vivo twophoton images. BioData Min. 2017; 10(1):40.
 5
Li W, Liu J, Xiao C, Deng H, Xie Q, Han H. A fast forward 3d connection algorithm for mitochondria and synapse segmentations from serial em images. BioData Min. 2018; 11(1):24.
 6
Long G, Kneip L, Alvarez JM, Li H, Zhang X, Yu Q. Learning image matching by simply watching video. In: European Conference on Computer Vision. Amsterdam: Springer: 2016. p. 434–50.
 7
Liu Z, Yeh RA, Tang X, Liu Y, Agarwala A. Video frame synthesis using deep voxel flow. In: Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE: 2017. p. 4463–71.
 8
Niklaus S, Mai L, Liu F. Video frame interpolation via adaptive convolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 2017. p. 670–9.
 9
Niklaus S, Mai L, Liu F. Video frame interpolation via adaptive separable convolution. In: Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE: 2017. p. 261–270.
 10
Niklaus S, Liu F. Contextaware synthesis for video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 2018. p. 1701–10.
 11
Bao W, Lai WS, Zhang X, Gao Z, Yang MH. Memcnet: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. In: IEEE transactions on pattern analysis and machine intelligence. Washington, D.C.: IEEE: 2019.
 12
Bao W, Lai WS, Ma C, Zhang X, Gao Z, Yang MH. Depthaware video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE: 2019. p. 3703–12.
 13
Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multiscale deep network. In: Advances in Neural Information Processing Systems. Montreal: MIT press: 2014. p. 2366–74.
 14
Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE: 2015. p. 2650–8.
 15
Wang P, Shen X, Lin Z, Cohen S, Price B, Yuille AL. Towards unified depth and semantic prediction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE: 2015. p. 2800–9.
 16
Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T. Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE: 2015. p. 2758–66.
 17
Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T. Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 2017. p. 2462–70.
 18
Ranjan A, Black MJ. Optical flow estimation using a spatial pyramid network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 2017. p. 4161–70.
 19
Grevera GJ, Udupa JK. Shapebased interpolation of multidimensional greylevel images. IEEE Trans Med Imaging. 1996; 15(6):881–92.
 20
Lee TY, Wang WH. Morphologybased threedimensional interpolation. IEEE Trans Med Imaging. 2000; 19(7):711–21.
 21
Penney GP, Schnabel JA, Rueckert D, Viergever MA, Niessen WJ. Registrationbased interpolation. IEEE Trans Med Imaging. 2004; 23(7):922–6.
 22
Sevakula RK, Singh V, Verma NK, Kumar C, Cui Y. Transfer learning for molecular cancer classification using deep neural networks. IEEE/ACM Trans Comput Biol Bioinforma. 2018; 16(6):2089–100.
 23
Shin M, Jang D, Nam H, Lee KH, Lee D. Predicting the absorption potential of chemical compounds through a deep learning approach. IEEE/ACM Trans Comput Biol Bioinforma. 2016; 15(2):432–40.
 24
Vijayan V, Milenković T. Multiple network alignment via multimagna++. IEEE/ACM Trans Comput Biol Bioinforma. 2017; 15(5):1669–82.
 25
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł., Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems. Long Beach: MIT press: 2017. p. 5998–6008.
 26
Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D. Image Transformer. In: ICML: 2018. p. 4052–4061. http://proceedings.mlr.press/v80/parmar18a.html.
 27
Wang X, Girshick R, Gupta A, He K. Nonlocal neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 2018. p. 7794–803.
 28
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE: 2019. p. 3146–54.
 29
Yuan Y, Wang J. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916. 2018.
 30
Huang L, Yuan Y, Guo J, Zhang C, Chen X, Wang J. Interlaced sparse selfattention for semantic segmentation. arXiv preprint arXiv:1907.12273. 2019.
 31
Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. 2019.
 32
Zhang Y, Tian Y, Kong Y, Zhong B, Fu Y. Residual dense network for image restoration. IEEE Trans Pattern Anal Mach Intell. 2020. https://ieeexplore.ieee.org/abstract/document/8964437.
 33
Zhang H, Goodfellow I, Metaxas D, Odena A. Selfattention generative adversarial networks. In: International conference on machine learning. PMLR: 2019. p. 7354–7363.
 34
Simonyan K, Zisserman A. Very Deep Convolutional Networks for LargeScale Image Recognition. In: International Conference on Learning Representations: 2015.
 35
Johnson J, Alahi A, FeiFei L. Perceptual losses for realtime style transfer and superresolution. In: European Conference on Computer Vision. Amsterdam: Springer: 2016. p. 694–711.
 36
Gatys LA, Ecker AS, Bethge M. Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 2016. p. 2414–23.
 37
Charbonnier P, BlancFeraud L, Aubert G, Barlaud M. Two deterministic halfquadratic regularization algorithms for computed imaging. In: Proceedings of 1st International Conference on Image Processing. Texas: IEEE: 1994. p. 168–72.
 38
Kingma DP, Ba J. Adam (2014), A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR), vol 1412.2015.
 39
Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945; 26(3):297–302.
 40
Ronneberger O, Fischer P, Brox T. Unet: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computerassisted Intervention. Munich: Springer: 2015. p. 234–41.
Acknowledgements
The authors would like to thank Mr. Lixin Wei and his colleagues (Institute of Automation, CAS) for the Zeiss Supra55 SEM and technical support.
Funding
The financial support of NSFC (61701497), Instrument function development innovation program of Chinese Academy of Sciences (No. E0S92308; 282019000057), and Strategic Priority Research Program of Chinese Academy of Science (XDB32030200) is appreciated.
Author information
Affiliations
Contributions
Conceived and designed the experiments: ZW, Performed the experiments: ZW, JL, XC, Analyzed the data: ZW, JL, GL, HH, Contributed materials: GL, HH. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wang, Z., Liu, J., Chen, X. et al. Sparse selfattention aggregation networks for neural sequence slice interpolation. BioData Mining 14, 10 (2021). https://doi.org/10.1186/s1304002100236z
Received:
Accepted:
Published:
Keywords
 Slice interpolation
 Biological tissue recovery
 EM images
 Sparse selfattention network
 Adaptive stylebalance loss