## Abstract

A statistical model for X-ray scattering of a non-periodic sample to high angles is introduced. It is used to calculate analytically the correlation of distinct diffraction measurements of a particle as a continuous function of particle orientation. Diffraction measurements with shot-noise are also considered. This theory provides a general framework for a deeper understanding of single particle imaging techniques used at X-ray free-electron lasers. Many of these techniques use correlations as a measure of diffraction-pattern similarity in order to determine properties of the sample, such as particle orientation.

## 1. Introduction

X-ray free-electron lasers (XFELs) can produce pulses of sufficient peak brightness to probe individual viruses, nanoparticles and large biological molecules [1]. With XFEL sources, coherent diffractive imaging techniques can be used to extract structural information about the sample from a diffraction pattern. The study of non-periodic samples is known as ‘single particle imaging’ to distinguish it from crystallography, the study of periodic samples.

Single particle imaging experiments at X-ray laser facilities involve taking a large number of diffraction snapshots of individual particles, typically injected into the path of the X-ray pulses in solution or as an aerosol [2]. A large number of measurements is required, because X-rays diffract very weakly from an individual particle, even with the high intensities of XFEL pulses. It is already possible to collect this data, because X-ray lasers typically have high repetition rates (10 Hz at SACLA [3], 120 Hz at LCLS [4] and in the future 27 kHz at the European XFEL [5]). However, the rapid injection of particles makes it very difficult to measure the individual orientation or conformation of a particle at the time of measurement. There are methods that can determine these parameters from the measured X-ray diffraction as part of the data analysis.

Analysis methods for single particle diffraction often search for information that is common to many different noisy diffraction measurements. For example, pattern-to-pattern correlations are used to classify and average similar diffraction patterns to improve signal to noise before determining orientations by the common arcs method [6]. Bayesian methods use likelihood measures to compare noisy data to a three-dimensional intensity model [7] or a manifold [8]. Graph-theoretic methods [9,10] and geodesic methods [11] map networks formed with Euclidean distances. By treating the dataset as a whole, Bayesian methods and graph-based methods are perhaps the most promising for treating the very low signals expected from individual protein molecules (10^{2}−10^{3} photons). Although notably, combined correlation–classification/common-arc orientation methods have also produced good results in low-signal, high-resolution simulations and continue to be actively pursued [12,13]. Bayesian- and correlation-based methods have also been combined [14].

Although a diverse array of algorithms have been developed, experimental demonstrations are few and at very low resolution [11,15]. It is still unknown how these algorithms will perform with realistic conditions, very low signals and at high resolution. While Poisson noise is frequently addressed in simulations, the effects of varying beam parameters, background noise, structural changes or radiation damage have yet to be studied in detail and remain outstanding issues. The further development of algorithms with more realistic simulations is hampered by the time- and memory-intensive nature of high-resolution simulations with a full dataset (10^{5}−10^{6} patterns). In many simulation studies, resolution [7] or the range of orientations [8] is restricted.

An alternative to large-scale simulations is to use statistical models to calculate results analytically. This approach was pioneered in an early study of pattern-to-pattern correlations in a diffraction classification scheme to improve the signal-to-noise of molecular diffraction [6]. The statistical foundations come from well-known results in the theory of crystallographic diffraction [16]. Huldt *et al*. [6] applied these results using a coarse-grained angular approximation sufficient for the limit of perfect alignment and the limit of a large misalignment angle, but lacking the continuity of single particle diffraction. Because single particle diffraction is inherently continuous, it is hard to apply statistical models further without addressing the issue of continuity.

In this study, we present a statistical model for continuous diffraction and continuous changes of particle orientation. We consider the mean pattern-to-pattern correlation as a continuous function of relative particle orientation with and without shot noise. These results compare favourably with simulation studies of virus diffraction that predict a Gaussian-dependence with a width characterized by the size of the virus, but not its internal structure [17]. Our approach provides a common framework to derive, and unify, the results of Huldt *et al*. [6] and Ziaja *et al*. [17]. We then use our approach to derive new results concerning the standard deviation of the correlation, with and without shot noise.

There are many different analysis algorithms, but the measures of diffraction similarity they use are much fewer, as explained above. Although we consider only correlations in this paper, we envisage that the statistical tools presented here could be used in future to study Euclidean distances and Bayesian likelihood estimates, so that statistical models can be applied more widely.

## 2. Theoretical model of single particle diffraction

The intensity scattered by a particle is given by the formula
2.1where *ϕ* is the incident fluence, *r*_{e} is the classical electron radius, *dΩ* is the solid angle and *F*(**q**) is the structure factor of the particle. To model the structure factor, consider the smallest parallelpiped that encloses the particle, defined by three vectors *a*, *b* and *c*. The Fourier transform of the solid parallelpiped is
2.2where * a**,

** and*

**b**** are reciprocal vectors and*

**c***A*,

*B*,

*C*are the lengths of

*a*,

*b*and

*c*. An orthonormal basis for the Fourier transform of any object enclosed by the parallelpiped can be defined by functions

*S*(

**q**−

**q**

*), where*

_{hkl}**q**

*=*

_{hkl}*ha** +

*kb** +

*lc**.

The function *S*(**q**) has the following useful properties
2.3
2.4
2.5The first two relations are properties of the sinc functions in equation (2.2). The last property can be proved by first translating all the functions *S*(**q** − **q*** _{hkl}*) by a constant vector

**q**

_{c}to define a new orthonormal basis 2.6The new basis can be chosen such that . In the new basis, there is one term in the sum over

*hkl*that takes the value 1 and, all other terms contribute 0 (using equations (2.3) and (2.4)). Therefore, equation (2.5) holds for all

**q**values.

The structure factor of a single non-periodic particle can be written as
2.7The terms *F _{hkl}* are samples of the Fourier transform of the particle's electron density. At high scattering angles, the terms

*F*can be described statistically by assuming that all atoms are randomly located [16]. The real and imaginary parts of

_{hkl}*F*are then drawn randomly from the following distribution 2.8where is the mean intensity at a pixel. Table 1 summarizes some useful quantities that can be calculated using this distribution. We assume that the variables

_{hkl}*F*are statistically independent, such that 2.9 This is exact if the object fills the parallelpiped used to construct

_{hkl}*S*(

**q**−

**q**

*) and the atomic positions are really uncorrelated, otherwise, it is an approximation. By defining the parallelepiped to be the smallest that can enclose the object, the error from making this approximation is minimized.*

_{hkl}To account for centrosymmetry, we need to exclude the case where from equation (2.9). This term contributes only in special cases, such as a rotation by *π* around the beam axis at low resolution. In order to present simple and concise derivations in what follows, we ignore the centrosymmetric contributions which are not difficult to include if required.

## 3. Evaluating correlations

The intensity measured by a pixel is denoted by *I*(**q**), and the intensity measured by the same pixel after the molecule has been rotated is denoted by *I*(**q** + **q*** _{α}*). A general three-dimensional rotation of the molecule can be specified by three Euler angles. However, it is more convenient to label the displacement vector

**q**

*by the angular distance between the two*

_{α}**q**-space points sampled by the pixel. For a general rotation,

**q**

*is not the same for all pixels and*

_{α}*α*is not necessarily equal to any of the Euler angles, though for each pixel it can be calculated from them. The mean value of the correlation between

*I*(

**q**) and

*I*(

**q**+

**q**

*) is then given by 3.1*

_{α}To simplify the notation, the three subscripts *hkl* have been replaced by a single subscript *n*, such that *S _{n}* ≡

*S*(

**q**

*−*

**q**

*), and we have defined . A simplification can be made with a judicious change of basis via equation (2.6), such that*

_{hkl}**q**=

**q**

_{0}. Then,

*m*=

*n*= 0 and equation (3.1) simplifies to 3.2

In order to calculate the mean and standard deviation of the correlation, we can use the fact that many terms in equation (3.2) are statistically independent. If two random variables, *A* and *B*, are statistically independent, then the following relation holds
3.3

Combining this relation with the fact that , we find that there are only two sets of values of *r* and *s* for which is not zero. The first case occurs when *r* = *s* = 0 and contributes the following term
3.4

The second case arises when *r* = *s* ≠ 0 and contributes

Using the results from table 1 and combining the two non-zero terms, we find that 3.6

The predicted angular-dependence of the correlation is approximately Gaussian for small angles *α*:
3.7where a resolution shell at a magnitude *q* has been considered, such that **q**_{α} = *qα*. The width of the angular-dependence is close, but not exactly what is given by simulations reported in [17]. In that paper, the half-width at half maximum value was found to be 1/4*qR*, whereas equation (3.7) predicts a narrower width of . The discrepancy is around 20% and is due to the fact that the statistical independence of *F _{hkl}* (equation (2.9)) is only approximately true for the icosahedral virus particle used in the simulations for reference [17]. Nevertheless, the theory presented here does account for the most significant contribution to the angular-dependence of the correlation.

## 4. Derivation of the standard deviation

The standard deviation of the correlation is calculated from 4.1

We again use equation (2.6) to choose a set of basis functions such that **q** is located at the position **q**_{0}. Therefore, we can write
4.2

Using the results from table 1, it can be shown that only terms with *m* = *s*, *n* = *t* or *n* = *t*, *m* = *s* are non-zero. When these restrictions have been applied, there are only six non-zero terms to evaluate
4.3
4.4
4.5and
4.6

Taking the sum of these equations, we obtain the result 4.7

The standard deviation is 4.8

For the limiting case of perfect alignment, *α* = 0, we have . For the case of large *α*, we have . Both these limits agree with those given by Huldt *et al*. [6].

## 5. Poisson noise

To determine the expected correlation when the measurement contains shot noise, we follow the method of Huldt *et al*. [6]. We define *P*(*K*, *I*) to be the probability of measuring a photon count *K* and having an expected intensity of *I*. The probability *P*(*K*, *I*) can be written as
5.1where *P*(*K|I*) is the conditional probability of measuring *K* when the expected intensity is known to be *I*. The conditional probability is given by the Poisson distribution:
5.2

This distribution can be used to show the following relations 5.3and 5.4A correlation of the photon count recorded at a detector pixel is given by 5.5

This shows that the mean correlation is not changed by the introduction of Poisson noise. The variance is calculated by a similar derivation and turns out to be 5.6

In the limit of large *α*, this agrees with Huldt *et al*. [6]. However, in that paper, there was a mistake for the *α* = 0 case. The correct limiting result is .

## 6. Pattern-to-pattern correlations

The results presented so far refer to the correlation of two diffraction measurements at a point (i.e. a particular pixel on the detector). A correlation of two diffraction patterns involves taking the sum over all the pixels. For an arbitrary rotation of the molecule in three dimensions, the difference of the **q** coordinate of a pixel, **q*** _{α}*, will be different for each pixel. The mean and standard deviation of the pattern-to-pattern correlation involves an integral over the point-to-point correlation (equation (3.6)) with a distribution of values for

**q**

*generated by the rotation. The intensity values on neighbouring pixels are not independent. As discussed in [6], the pattern-to-pattern correlation is calculated by scaling the mean pixel correlation by the number of independent variables needed to describe the intensity. The number of independent variables needed to describe the intensity is determined by sampling along the*

_{α}**,*

**a**** and*

**b**** directions at twice the rate as that used for the structure factor in equation (2.7).*

**c**A simple case occurs when the molecule only rotates around the beam axis, so that **q*** _{α}* is a function of resolution shell, but not the polar angular coordinate of the pixel. To provide a simple illustration of the results we have obtained, we show an example of this special case rather than show a general three-dimensional rotation. The number of independent sampling points in the resolution shell was set to 200, which corresponds to a resolution 32 times smaller than the particle, e.g. the diffraction from a 9.6 nm particle at 0.3 nm resolution. The mean correlation as a function of angle is plotted in figure 1. The range of correlation values within three standard deviations of the mean is also shown, indicating how the distribution narrows as

*α*increases. As a consequence of this change of distribution, the mean correlation is not a perfect indicator of the most probable angle that gives rise to that correlation value. This can be seen in figure 2, which shows that the mean correlation of perfectly aligned particles, , can arise from a range of orientations with close to equal probability. The most likely angle is around 1/4

*qR*. Values of the correlation higher than the mean value are needed to ensure selecting the aligned case with the greatest probability.

## 7. Conclusion

By addressing the issue of continuity in statistical models, we aim to improve their suitability for the analysis of single particle diffraction, which is inherently continuous. We view statistical models as potentially complimentary to numerical simulations in ongoing efforts to address the outstanding analysis issues for single particle imaging, particularly for low-signal, high-resolution applications to individual molecules. Although not all single-particle analysis algorithms are based on correlations, the methods we have presented could potentially be applied in future to Euclidean distances and likelihood functions to broaden the applicability of statistical models to Bayesian methods, graph-theoretic methods and manifold techniques.

## Acknowledgements

This research was supported by the Australian Research Council through its Centres of Excellence programme.

## Footnotes

One contribution of 27 to a Discussion Meeting Issue ‘Biology with free-electron X-ray lasers’.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.