## Abstract

The ability to serially interrogate single biomolecules with femtosecond X-ray pulses from free-electron lasers has ushered in the possibility of determining the three-dimensional structure of biomolecules without crystallization. However, the complexity of imaging a sample's structure from very many of its noisy and incomplete diffraction data can be daunting. In this review, we introduce a simple analogue of this imaging workflow, use it to describe a structure reconstruction algorithm based on the expectation maximization principle, and consider the effects of extraneous noise. Such a minimal model can aid experiment and algorithm design in future studies.

## 1. Introduction

A potential route to crystal-free three-dimensional structure determination, studied in [1], involves injecting very many copies of a biomolecule to serially diffract from a steady stream of intense, femtosecond pulses from X-ray free-electron lasers (XFELs) [2–6]. These femtosecond X-ray pulses serve as ultrafast shutters, because photons in these pulses can effectively outrun the Coulombic explosion [7–9] and electronic damage [10] initiated on the samples that they illuminate. This idea of ‘diffraction-before-destruction’ was shown feasible for pulses in the 5–50 fs range, for soft X-rays [11] and hard X-rays [12–14]. Further demonstrations of two-dimensional imaging using XFELs include cells [15,16], viruses [17,18] and submicrometre aerosols [19].

For three-dimensional single-particle imaging using XFELs, photon-limited, two-dimensional diffraction patterns from similar biomolecules are proposed to be algorithmically oriented and assembled into a three-dimensional diffraction volume, from which the biomolecule's three-dimensional structure is recovered via iterative phasing [20]. Among numerous reconstruction algorithms designed for this task [21–24], this paper focuses on the EMC algorithm^{1} [25–27], which is based on the expectation maximization principle [28]. EMC succeeds even with extremely noisy patterns and additional unmeasured parameters besides orientations [29].

Ideally, because the raw data measured for reconstruction are randomly oriented, noisy spherical sections (Ewald sphere tomograms) of a minimally damaged biomolecule's Fourier intensities, we refer to this XFEL imaging technique as *serial femtosecond cryptotomography* (SFC; figure 1) [29]. Although the recovery of structural information from SFC data may appear forbiddingly complex, the primary ideas behind them can be explained with a minimal model. We introduce such a minimal model, use it to explain the EMC algorithm, and study the effects of extraneous photons on structure determination.

## 2. A two-dimensional analogue of serial femtosecond cryptotomography

We introduce SFC by summarizing its key elements as described in [25,26]. Let *W* denote a biomolecule's three-dimensional diffraction intensity distribution. Ewald sphere tomograms of the biomolecule are captured on photon detectors whose pixels are labelled by the index *i* = 1 *… M*_{pix}. If the detector samples the Ewald sphere tomogram of the biomolecule in the reference orientation at three-dimensional reciprocal space vectors *q** _{i}*, these samples now occur at

**(**

*R**Ω*) ·

*q**for a biomolecule that suffers a rotation*

_{i}

*R*^{−1}(

*Ω*). Because

*W*is typically band- and resolution-limited, a sufficiently fine discretization of all possible three-dimensional rotations

*Ω*

_{j}_{=}

_{1}

_{…}

_{M}_{rot}can represent the full set of SFC patterns

*W*(

**(**

*R**Ω*) ·

_{j}

*q**).*

_{i}Because each SFC scattering data *K* is effectively a random, photon-limited sample of the set of all possible tomograms (*W*(** R**(

*Ω*) ·

_{j}

*q**) ≡*

_{i}*W*), we define a Poisson noise model where the probability of observing a noisy scattering pattern

_{ji}*K*of a particular tomogram

*W*(

*Ω*): 2.1

_{j}In principle, the formalism in the following sections remain unchanged with alternative parametrized noise models instead (see appendix A (a)). However, inaccurate models may cause less effective reconstructions.^{2}

We now pose a pedagogical analogue that share key features with SFC. Recall that the goal in SFC is to recover the true three-dimensional intensities from a large number of noisy, unoriented data {*K*}. First, we limit all data orientations to one of four that are related by *π*/2 rotations about a common axis. Second, instead of a set of three-dimensional diffraction intensities, is a two-dimensional image contrast (figure 2) comprising pixels whose positions on the unrotated image are *x _{i}* instead of

*q**. Furthermore, we project the histogram of pixel values onto an exponential distribution ([30] describes histogram projections), to mimic that from diffraction intensities of uncorrelated scatterers*

_{i}^{3}[25]. Finally, the set of data

*K*are still drawn from the set

*W*(

**(**

*R**Ω*

_{j}_{=}

_{0}

_{…}_{3}) ·

*x*), which are two-dimensional images. We refer to these analogous

_{i}*K*as ‘tomograms’ to be consistent with three-dimensional SFC.

Both three-dimensional SFC and its two-dimensional image reconstruction analogue operate on unoriented data that obey the same noise model in (2.1). In both cases, we wish to identify a single model *W* that is most likely to result in a stream of noisy data {*K*} that obey (2.1). The primary difference between them is that the two-dimensional image reconstruction problem contains fewer spatial degrees of freedom.^{4} The datasets in both three-dimensional SFC and two-dimensional image reconstruction are redundant: when correctly oriented, any pair of {*K*_{1}, *K*_{2}} in SFC intersects along a common arc in the target object's three-dimensional diffraction volume [1], and in the latter pixels on an image data, *K* can be mapped one-to-one onto pixels on another image data via an in-plane rotation.

Although this two-dimensional image reconstruction completes when the ‘tomograms’ are assembled, a further step of iterative phase retrieval is needed for structure determination in three-dimensional SFC [26]. Although phase retrieval [31] is separated here from orientation discovery to explicitly study the latter, the interplay between the two was explored in [25].

## 3. The expand, maximize, compress algorithm

Reconstructing an object's intensities in SFC can be related to cluster analysis based on distribution models, where one tries to probabilistically cluster many scattering data {*K*} into tomograms that are consistent with the data they comprise (2.1). Only when these clusters are correctly assembled are the object's true Fourier intensities recovered.

What is notable in SFC is that these clusters are also related by rotations. Enforcing these orientational relationships between clusters allows data sharing between clusters, beneficial when the datasets {*K*}, and hence the clusters they form, are exceptionally noisy. Without such relationships, once the photons in noisy images have identified with an orientation cluster, they are no longer available for signal-averaging along known intersections with other orientation clusters. In principle, such data sharing can be extended to other known relationships between clusters.

*Ab initio* reconstruction is performed iteratively, starting from a guess model comprising a random set of numbers. We *expand* this guess model *W* into known tomograms *W _{j}*, use expectation-

*maximization*[28] to ‘soft’-cluster {

*K*} into these tomograms and

*compress*these ‘maximized’ tomograms via their mutual orientation relationships into a single more compatible new model

*W*′. In three-dimensional SFC, one can further impose that

*W*are the band-limited diffraction intensities of a real-valued electron density distribution [25]. The output of the final compression step is now ready for another round of EMC. This is iterated until

*W*numerically converges (figure 2).

The definiteness in the orientation *Ω* of each image *K* is replaced by a probability distribution over the orientations *P*(*Ω|K*, *W*) in the EMC. As the guess model *W* is iteratively improved, datasets {*K*} gradually maximize their likelihood given the noise model (2.1) by adjusting their probability masses between orientation clusters. When reconstructions succeed, *P*(*Ω|K*, *W*) for each data peaks around a small set of orientations [32], and approaches a delta function if *K* were noiseless.

Figure 2 shows a numerical simulation of a successful reconstruction with our two-dimensional analogue. A two-dimensional sample contrast, plus versions of itself rotated by *π*/2 were prepared. A very large, random selection of these images were Poisson-noise sampled and their orientations were forgotten. EMC was made to reconstruct the original sample contrast given only these noisy, unoriented datasets. The average data image in the bottom row of figure 2 had approximately 10 photons. A total of 2 × 10^{5} datasets were used for the reconstruction. Specific implementation of the reconstruction is elaborated in appendix A (a), whereas further details on three-dimensional SFC can be found in [26,30]. The computer codes used to reconstruct noisy images such as those in figure 2 are publicly available.^{5}

Reconstructions from different random sets of initial models will naturally converge in the neighbourhood of the true source image contrast if the data tomograms over-constrain the model. When reconstructions from random initializations yield vastly different reconstructions, over-fitting of model parameters may have occurred and assumptions for the datasets used in reconstruction should be revisited.

Regardless of the reconstruction algorithm used, a mutual information criteria can identify when reconstructions attempts on the true model become ambiguous given the average number of photons *N* per data tomogram [25,26]. Measurements {*K*} can be thought of as the output of a noisy channel through which encoded information about the model is sent. A reconstruction algorithm's task is to ‘decode’ *W* using these measurements {*K*}.

This mutual information criterion is pessimistic, because it evaluates the success based on the average data *K*, neglecting the fact that it could fortuitously contain a subset of ‘bright’ data *K*_{bright} sufficient for reconstructions to succeed. However, this optimism is justified only with enough data *M*_{data} for there to be sufficiently many *K*_{bright} [32].

## 4. Extraneous photons

Here, we use the two-dimensional analogue of SFC to explore these consequences of extraneous photons on structure recovery, and gain insights for a more comprehensive, but computationally expensive three-dimensional study.

Typically, SFC data will contain extraneous photons scattered from upstream optics, solvent and/or carrier gases [33]. This presents two important consequences to SFC. First, owing to the randomness of sample injection (see appendix A (b)), only a fraction of SFC diffraction patterns contain scattered photons from samples, or *hits*, which are contaminated with extraneous photons. These hits must be separated from patterns that only contain extraneous photons, *blanks*. Second, even if blanks could be completely eliminated, then the extraneous photons in hits still frustrate SFC reconstructions [27]. These consequences are particularly limiting for weakly scattering nanometre-scale biomolecules when extraneous photons dominate both hits and blanks.

### (a) Separation parameter

The fundamental difficulty here is that although we measure the sum of sample photons and background photons, we seek only the former. A natural scheme to separate these two photon sources subtracts the average background photon count from their sum. Intuitively, the relative error incurred in this scheme should be related to the extent that these two distributions overlap.

We propose a separation parameter, *ψ*, which captures the difficulty of differentiating extraneous photons from sample photons (figure 3). Suppose *b* and Δ*b* denote the total number of background and sample photons in an image, respectively. The simple ratio of the difference between their photon averages to the sum of their standard deviations^{6} defines their separation parameter
4.1

Hits and blanks are most easily identified when , which occurs when , or , or both. Appendix A (c) relates *ψ* to the Kullback–Leibler divergence between the distributions of hits and blanks.

### (b) Simultaneous hit detection and orientation discovery

We consider the case when the structure of both the background and sample is initially unknown. This occurs when extraneous photons are primarily owing to sample injection, or if the background changes considerably in time. Ideally, if an unknown contrast can be recovered from incomplete measurements by enforcing the underlying noise model (2.1), then one should also be able to recover the background if its noise model is known.

Complications, however, arise with extraneous background. While a noise model for background plus sample can be formed (A 10), the equivalent model for only sample photons is unclear. As a consequence, a naive EMC implementation can merely recover sample contrast with background, *W*_{b} _{+} _{s}, instead of only sample contrast, *W*_{s}. Although background subtraction could succeed when a model background is additionally recovered, *W*_{s} = *W*_{b} _{+} _{s} − *W*_{b}, it is not mathematically obvious how this subtracted result can improve how we orient data, because they almost never contain exclusively sample photons. Further, when a data stream is dominated by blanks or if background photons overwhelm sample photons in hits, then EMC will erroneously cluster data into tomograms based primarily on the background features.

To study some of these consequences using the two-dimensional analogue to SFC, we consider the simplistic case where extraneous photons in blanks originate from X-ray pulses interacting with stationary experimental apparatuses and hence, unlike samples, have a fixed orientation. A two-dimensional sample contrast (true contrast in top right box of figure 2) plus versions of itself rotated by π/2 (*hits*), and an average background pattern were prepared^{7} (*blank*). These five images were randomly sampled with replacement, to generate a stream of random, unoriented two-dimensional images that was further Poisson-sampled to simulate photon-limited noise. The average fraction of hits in this data stream was a generous 0.4 (compare with appendix A (b)).

EMC was made to cluster the datasets into five classes (*Ω _{j}*

_{=}

_{0}

_{…}_{4})—four orientational tomograms of

*W*

_{b}

_{+}

_{s}, and a fixed-orientation background

*W*

_{b}—but only enforce orientation relationships between the tomograms only in the compression step. No explicit relationship between

*W*

_{b}

_{+}

_{s}and

*W*

_{b}was imposed. Only after reconstruction converges is the orientationally averaged background, , subtracted from the signal-averaged

*W*

_{b}

_{+}

_{s}reconstruction. Doing so allows nominal hit detection followed by the extraction of photons owing to the sample. More sophisticated background subtraction schemes are deferred to a later study.

From figure 4, we observe that hit detection at the expense of sample contrast is still possible when the separation parameter , as averaged over the data {*K*}.

To quantify the impact of *ψ* on EMC reconstructions, consider the quantity *P*_{truehit} − *P*_{falsehit}. This difference equals one when the probability of detecting true hits is perfect and no false hits were included in the *W*_{b} _{+} _{s} model. For this reason, we define this quantity as *hit purity* (elaborated in appendix A (d)). Remarkably, figure 5 shows a clear monotonic relationship between hit purity and the separation parameter *ψ*, and advises us to stay well above *ψ* > 1.

We measure EMC's effectiveness in hit detection against the maximum hit purity achievable if a total photon count threshold were used to discriminate hits from blanks (elaborated in appendix A (e)). Of course, determining such an optimal threshold is non-trivial when the structure of both blanks and hits are initially unknown. Figure 3 illustrates the difficulty involved: can one define a threshold photon count to separate the subpopulations of hits and blanks when only the combined histogram of their photon counts is measured? This is especially challenging when the separation parameter is small.

Figure 5 shows that for separation *ψ* > 1, EMC's hit-detection approaches the maximum hit purity achievable by threshold-based detection, even though an optimal threshold was not supplied to EMC initially. We speculate that EMC's poorer performance when *ψ* < 1 occurs, because background features become dominant in the hits, which then attract blanks towards the background + sample model, *W*_{b} _{+} _{s}. This is consistent with how *P*_{falsehit} falls faster than *P*_{truehit} when *ψ* decreases in figure 4. Nevertheless, when , the blanks detected by EMC using only a smaller subset of the data can inform us of a hit-detection photon-threshold, which could then be applied more efficiently to more numerous data. These results from EMC could also provide a population of blanks necessary for an efficient two-pass hit detection based on data likelihood (see appendix A (f)).

## 5. Conclusion

Many probability distributions affect SFC such as the random orientations of individual samples, the Poisson statistics of the photons they scatter, the fluctuating character of the X-ray pulses [34], the random arrival of samples into the X-ray interaction region, the presence of extraneous background noise and particle heterogeneity [32]. Understanding the effects of these distributions, and knowing which ones dominate is critical to the success of SFC. Compared with an extensive and realistic study, the minimal model presented here serves as an efficient platform for testing our understanding of these effects, their mitigation strategies and possibilities for optimization. As an example, this paper studies how the EMC algorithm [26] can detect and reconstruct with randomly occurring sample signal despite extraneous photons.

## Acknowledgements

We acknowledge the support from Lee Kuan Yew Postdoctoral Fellowship; Human Frontier Science Programme; AMOS programme within the Chemical Sciences, Geosciences and Biosciences Division of the Office of Basic Energy Sciences, Office of Science, US DOE. The author thanks the guidance from Veit Elser, who inspired many ideas in this paper, and Hyung Joo Park for his careful reading of the manuscript. Conversations with Anton Barty, Andrew V. Martin, Filipe R.N.C Maia, Tomas Ekeberg, Max Hantke and Gijs van der Schot were indispensable to the preparation of this paper.

## Appendix A

#### (a) EMC algorithm in two-dimensional image reconstruction

In the EMC, the expansion step is a straightforward representation of the current model as measurable data tomograms: *W* → *W*(*Ω _{j}*) =

*W*. The expectation maximization step follows, where a new expanded model replaces the current one

_{j}*W*to extremize the log-likelihood A1 A2

_{j}This extremization leads to the following update rule for *W* → *W’*:
A3with *P*(*K|W*(*Ω _{j}*)) defined in (2.1). Although a uniform prior is used for

*P*(

*Ω*) in (A 3), we can estimate the actual distribution when the model iteratively converges: A4

After the pixels in each orientation representation *W*(** R**(

*Ω*

_{j}_{=}

_{0}

_{…}

_{3}) ·

*x*) have been separately updated, they are compressed to concur on a common model image

_{i}*W*. Care must be taken to ensure that the compressed model is properly normalized to be compatible with the datasets [26]. Two rudimentary checks are: merely expanding and compressing the model leaves it unchanged; the average total power of the tomograms is comparable to that of the data. Unlike three-dimensional SFC, pixels in our two-dimensional images rotate perfectly into that of any other orientation, which makes pixel interpolation [26] unnecessary. Hence, we simply average the images in the four orientational tomograms

*W*(

**(**

*R**Ω*) ·

*x*) after they have been individually rotated back to the reference orientation: A5

_{i}Each reconstruction attempt begins with a random set of real numbers as the initial model *W*^{(0)} to limit the model bias, because expectation maximization algorithms can get trapped in local log-likelihood extrema. To reduce the number of initial iterations necessary, we normalize the sum of contrast values on these random initial models *W*^{(0)} to match the sum of the average data:
A6where *M _{K}* is the number of noisy data images.

Reconstruction attempts in this paper are terminated when the model change Δ*W* between iterations becomes numerically negligible: Δ*W* = *||W’* − *W||* < *δ*_{cut-off}. Of course, convergence does not imply reconstruction success.

#### (b) Considerations for optimal hit fractions

The probability of illuminating *N _{F}* particles with an X-ray pulse would likely follow a Poisson distribution
A7

Excepting fluctuation correlation methods [23], reconstruction algorithms such as the EMC typically operate on diffraction data {*K*} that are dominated by single-particle patterns. Translating this to a specific requirement, suppose we wish for
A8the maximum average number of particles illuminated by pulses . At this hit rate, the number of single-particle patterns collected at a 120 Hz XFEL source will be 141 in a minute, 8469 in an hour and 101 627 in 12 h.

#### (c) Relationship of separation parameter *ψ* to Kullback–Leibler divergence

A natural measure of differences between two statistical distributions is their Kullback–Leibler divergence. For a well-behaved photon integrating detector, the probability that photons *K _{i}* are measured by its

*i*th pixel, either from background (

*b*) or background + sample (

_{i}*b*(1 + Δ

_{i}*)) sources are A9 A10*

_{i}We choose the Kullback–Leibler divergence between these two distributions as
A11where *b* = ∑* _{i}b_{i}* and Δ

*b*=

*∑*(Δ

_{i}*). In the limit when , A12 A13which gives us A14Equation (A 14) shows a monotonic relationship between the separation parameter and the Kullback–Leibler divergence between two Poisson distributions. Using (A 1) and (A 11), it is straightforward to numerically verify that this monotonic relationship is valid for the critical range of 0 < Δ < 1, and 0 <*

_{i}b_{i}*b*< 1.

#### (d) Hit purity

We define *P*_{truehit} as the probability that a hit is correctly identified. Hence, the probability that a hit is incorrectly marked as a blank will be
A15A complementary relationship exists
A16Subtracting these two equations gives us a symmetric relationship:
A17which we denote *hit purity*. Because EMC does not assign each image as either a hit or a blank, these probabilities are computed here in an averaging fashion across all images {*K*}:
A18
A19
A20
A21with *P*(*Ω*) defined in (A 4). Hit purities in figure 5 are evaluated only on the converged reconstructions *W*.

#### (e) Hit detection with an intensity threshold

Consider the naive hit-detection scheme where data are classified as blanks (i.e. only background photons are present), if their total photon count does not exceed a threshold *μ*. If the noise models for blanks and hits (i.e. both background and sample photons are present) are described by (A 9) and (A 10) respectively, then the hit purity in (A 17) for this threshold-based hit detection is
A22

The following definitions apply here: *b* is the average number of background photons per blank; *b*(1 + Δ) is the average number of background plus sample photons per hit; *k* is the total number of photons on a data frame to be considered for hit/blank detection; *μ* is the threshold for discriminating blanks from hits.

For a particular combination of *b* and Δ, we can numerically evaluate
which maximizes the hit purity in (A 22). The separation parameter *ψ* can also be computed from (4.1). The resultant relationship between *μ*_{max} and *ψ* is interpolated to form the solid curve in figure 5.

#### (f) Hit detection with a log-likelihood threshold

If we can identify a sufficiently large number of blanks, then a two-pass ‘variance-by-hand’ method can be adopted to select a threshold for detecting blanks, which also serves as hit detection to some degree. On the first pass through all blanks, {*K*_{blanks}}, an empirical histogram for *P _{B}* in (A 9) can be computed. On a second pass through {

*K*

_{blanks}}, assuming the correlations between pixel readouts to be negligible, we compute the data log-likelihood for the pixels (indexed

*i*= 1

*… M*

_{pix}) A23

From the distribution of for these blanks, one can identify the log-likelihood threshold above which we are 99% confident of identifying blanks. This also serves as a 1% significance test for background (i.e. *P*_{falseblank} = 0.01)—cases with log-likelihood below this threshold are unlikely to contain background, which by default here are hits.

## Footnotes

One contribution of 27 to a Discussion Meeting Issue ‘Biology with free-electron X-ray lasers’.

↵1 Abbreviated by the sequential operations in the reconstruction algorithm: expand, maximize, compress.

↵2 For example, approximating the sum of a small number of discrete events as a chi-squared distribution could increase the rate of false negatives in hypothesis testing, ch. 14.3.2 [35]. We expect a similar effect here if inaccurate noise models are used.

↵3 The joint probability in (2.1) is indifferent to whether the model

*W*belongs to an image contrast or describes a distribution function of diffraction intensities in SFC. As long as their*W*s share the same probability distribution, enforced here with a histogram projection, the effectiveness of using the unordered product in (2.1) for orientation discovery is in principle identical.↵4 Because SFC reconstructions succeed with a sufficiently fine discrete sampling of orientations (discussed in [26]), this two-dimensional analogue merely reduces the computation complexity by reducing the number of orientations. Computation time scales quadratically with the number of orientations,

*M*_{rot}: maintaining the same level of signal-averaging per orientation cluster, the number of data images needed scales linearly with*M*_{rot}; simplistic algorithms compare the number of images with the number of orientation clusters. We do not expect the number of EMC iterations needed for convergence to scale with*M*_{rot}when the data have sufficiently high signal-to-noise ratios.↵5 See http://www.duaneloh.com, or email the author.

↵6 For the sake of simplicity, total number of photons received on hits and blanks are assumed to follow Poisson distributions, whose skewness is ignored in the separation parameter. Using variances in the place of standard deviations in the denominator of

*ψ*incorrectly drops*ψ*s needed dependence on*b*.↵7 The ensemble of average backgrounds is spatially homogeneous, which makes the reconstruction more difficult than if background were confined to specific regions of the image that one may ignore.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.