## Abstract

X-ray free-electron laser diffraction patterns from protein nanocrystals provide information on the diffracted amplitudes between the Bragg reflections, offering the possibility of direct phase retrieval without the use of ancillary experimental data. Proposals for implementing direct phase retrieval are reviewed. These approaches are limited by the signal-to-noise levels in the data and the presence of different and incomplete unit cells in the nanocrystals. The effects of low signal to noise can be ameliorated by appropriate selection of the intensity data samples that are used. The effects of incomplete unit cells may be small in some cases, and a unique solution is likely if there are four or fewer molecular orientations in the unit cell.

## 1. Introduction

Despite its enormous success, three problems that continue to plague protein X-ray crystallography are crystal preparation, radiation damage and phase determination. The first problem is a major bottleneck to structure determination, because many important biomacromolecules, such as membrane proteins, are extremely difficult to crystallize and in turn optimize for high-resolution X-ray diffraction [1]. Radiation dose presents a conundrum, because increasing the incident X-ray flux to obtain accurate high-resolution diffraction data comes with increased radiation damage that perturbs the high-resolution structure, thus counteracting the desired gain in resolution. While modern macromolecular phasing methods are very powerful, they generally depend either on pre-existing knowledge of the structure of a homologous molecule (molecular replacement) or on anomalous scattering experiments (single or multiple anomalous dispersion) to obtain initial phase estimates. However, molecular replacement can suffer from unacceptable model bias in the case of a new structure, and collection of sufficiently strong anomalous diffraction signals is not always straightforward.

The recent development of X-ray free-electron lasers (XFELs) has the potential to circumvent the first two problems [2,3]. First, the extremely high flux from these sources enables measurable diffraction data to be obtained from nanocrystals only a few unit cells across, which are generally easier to prepare than macroscopic crystals. Second, the duration of an XFEL pulse is so short (currently 10–100 fs) that diffraction patterns can be obtained before significant development of the damaging photoelectron cascade that culminates in destruction of the crystal takes place [4,5]. This approach is referred to in general as ‘femtosecond nanocrystallography’. A method to solve the phase problem from such data is still needed, however. Molecular replacement has recently been used to solve the structure of *Trypanosoma brucei* cysteine protease cathepsin B from XFEL data [6]. It is not clear whether the method of isomorphous replacement can be suitably adapted because of the experimental difficulty of obtaining isomorphous nanocrystals. A version of multiple-wavelength anomalous dispersion (MAD) phasing of XFEL nanocrystal data that incorporates the effects of ionization damage of heavy atoms in the presence of the high fluence XFEL pulse has been proposed [7], although its practical utility still requires experimental verification. Conventional single-wavelength anomalous dispersion (SIR) phasing has been successfully applied to XFEL data from a gadolinium-lysozyme complex [8]. Therefore, alternative methods of phasing are of significant interest.

Two alternative approaches for using XFELs for imaging molecules have been proposed. The first involves collecting diffraction data directly from individual single molecules and inverting them to generate a spatial image of the particles. A landmark experiment of this kind applied to a mimivirus was reported by Seibert *et al.* [9]. Although there are still practical difficulties surrounding this technique in terms of particle orientation determination and attainable signal levels, this method embodies the most direct spirit of diffraction imaging. The second approach uses many XFEL diffraction patterns from ensembles of a large number of identical randomly oriented single molecules (as in a solution), and the electron density for the constituent molecule of the ensemble is determined by analysis of the angular correlations of the intensity data using a method first proposed by Kam [10]. This approach has been investigated in detail [11,12], although it has so far been limited to low resolution and symmetric particles. A good review of the use of angular correlations in single-molecule imaging is given by Kirian [13]. These two approaches are not discussed further in this review, and we focus on nanocrystallography.

A typical nanocrystallography experiment proceeds as follows. First, fully hydrated nanocrystals are introduced into the pulsing XFEL beam via a liquid injection apparatus [14]. Diffraction patterns are recorded from individual nanocrystals of different sizes in random orientations as they travel across the beam [15]. Hundreds of thousands of diffraction patterns are collected in this way, and blank and unsuitable patterns (e.g. owing to high background, multiple crystals in the beam, pattern intensity too weak, artefacts, etc.) are removed using hit-finding software. The orientation of each remaining pattern is then determined by automated Miller indexing [16,17]. Finally, equivalent reflections are averaged over the ensemble of patterns in a process called Monte Carlo integration which increases the signal-to-noise ratio (SNR) and also averages out the effects of a variable incident X-ray pulse flux, reflection partiality, and other unknown experimental variables, to produce a set of measured structure amplitudes.

A consequence of diffraction by very small crystals is that the diffraction pattern is continuous, and there is measurable diffraction between the Bragg peaks. It is well known that diffraction amplitude information between the Bragg reflections principle allows the phase problem to be solved without any ancillary experimental data [18–21]. In this paper, we review proposals for direct phasing based on using the whole, continuous, XFEL diffraction dataset from nanocrystals, and discuss some of the issues involved.

## 2. Direct phasing of nanocrystal diffraction

Consider first an ensemble of nanocrystals, each made up of an integral number of a single kind of unit cell. Assuming that the crystal is well ordered and that there is a single definition of the unit cell contents (this is discussed further in §3), the ensemble-averaged continuous diffracted intensity *I*(**u**), determined experimentally as described in the previous section, is given by [16,22]
2.1where **u** is position in reciprocal space and *F*(**u**) denotes the Fourier transform of the contents of one unit cell. The quantity *Q*^{2}(**u**) is referred to as the ‘averaged shape transform function’, and is the average of the Fourier transform of the shapes of the nanocrystals in the ensemble. Inspection of equation (2.1) shows that the data *I*(**u**) inherently contains information on the continuous transform amplitude . Therefore, on the basis of well-known properties of multi-dimensional phase problems [19], if can be extracted from *I*(**u**), then it should be possible to reconstruct the electron density in the unit cell, denoted by *f*(**x**), without any experimental phase information, using standard phase retrieval techniques [23,24]. This is the basis of proposals for direct phasing in nanocrystallography. If uncorrelated lattice disorder is present in the nanocrystals, then the effect can be incorporated into *Q*^{2}(**u**) and equation (2.1) still applies. If correlated, or cumulative, lattice disorder is present or there is molecular disorder on the crystal surface, then equation (2.1) is not satisfied exactly, and a somewhat different approach to phasing has been described in this case [25]. However, at least in some cases so far observed, the presence of strong interference fringes in diffraction patterns indicates that the effects of disorder are small [15].

Two approaches have been proposed for direct phasing based on equation (2.1). The first approach, proposed by Spence *et al*. [26], consists of first estimating *Q*^{2}(**u**) from the diffraction data (this is necessary because it depends on the crystal size distribution and is therefore unknown *a priori*). The estimate of *Q*^{2}(**u**) is obtained by partitioning the averaged-diffracted intensity into fixed regions around each Bragg peak (Wigner–Seitz cells) and then averaging those regions together. Because *Q*^{2}(**u**) is periodic, performing this average yields one period of the averaged shape transform, which can then be copied and translated periodically throughout the reciprocal lattice to generate an estimate of *Q*^{2}(**u**), denoted here by . This approach is effective, because is uncorrelated with *Q*^{2}(**u**) [22,26]. Once is obtained, is estimated as
2.2and can be phased by the usual iterative phasing techniques. Spence *et al.* [26] conducted simulations with realistic experimental conditions and demonstrated that with enough diffraction patterns, the amplitude of the continuous molecular transform could be recovered by this technique and the amplitude subsequently phased.

Although straightforward, in principle, the difficulty in evaluating via equation (2.2) is that *Q*^{2}(**u**) is sharply peaked at the reciprocal lattice points and small between them, so that the division in equation (2.2) amplifies errors in the measurements when estimating between the reciprocal lattice points. Note that smaller nanocrystals give broader peaks in *Q*^{2}(**u**) reducing the noise amplification resulting from the division, but for a fixed incident X-ray pulse flux, smaller nanocrystals diffract more weakly, reducing the inherent overall SNR in the data.

The second approach to direct phasing of nanocrystal diffraction data has been proposed by Elser [27]. In order to avoid use of the weak diffraction between the Bragg reflections, and also to use the diffraction data directly rather than first estimating the continuous molecular transform, Elser uses the amplitudes of the Bragg reflections and supplements them by estimates of the gradients of the diffracted intensity at the Bragg positions. For small crystals, the effect of a gradient in the molecular transform is to introduce a small shift in the position of an intensity peak away from the Bragg position. The required gradient is estimated by first estimating the shape transform function as described above, and then fitting the measured intensity to a Taylor series expansion around the Bragg position which allows the intensity and its gradient at the Bragg position to be estimated. Note that because the crystallographic phase problem is underconstrained by a factor of two [28,29], the fourfold increase in the number of data when supplemented by the three gradients at each Bragg reflection renders the phase problem overconstrained by a factor of two. Elser then develops an iterative projection algorithm that incorporates projections onto the amplitude and gradient data. He also addresses the problem associated with the lack of a unique origin for the molecule and the consequential wrapping of the electron density in the unit cell. Simulations at low signal to noise show that this approach has potential.

Returning to the proposal of Spence *et al.* [26], the SNR of the data used for phasing has a wide dynamic range as a function of position in reciprocal space, being larger close to the Bragg reflections and smallest midway between them. As noted above, oversampling the Bragg amplitudes by a factor greater than two is sufficient to render the phase problem unique. Therefore, it is not necessary to sample the continuous intensity finely between the Bragg reflections. The effects of the small and variable SNR can be reduced by using only a subset of the available amplitude data, avoiding those where the SNR is low and allowing them to float during the phase retrieval process. The additional data should therefore be close to the Bragg reflections where the SNR is largest, the only caveat being that they are not too close as in that case the additional information provided becomes very sensitive to noise. This is analogous to sampling theory where interpolation is possible in theory as long as the average sample density satisfies the Nyquist criterion, but it becomes increasingly sensitive to noise as the samples become more bunched together.

We have conducted simulations to evaluate the effect of using such a sampling scheme [30]. A three-dimensional section of the electron density of the membrane protein aquaporin 1 (AQP1) [31] was used for the simulations. Reciprocal space was oversampled by a factor of three in each dimension. Random-sized cuboid crystals were considered with a Gaussian edge-length distribution and a standard deviation equal to one-third of the mean number of unit cells on each edge. We set this mean to be the same in each direction in our simulation and denote it by *μ _{N}*. The diffracted intensity from the crystal ensemble was calculated, Poisson noise added, and the noisy intensities divided through by the averaged shape transform to generate an estimate of the molecular transform intensity to be used for phasing. Phase retrieval was conducted using the difference map algorithm [32] with a support constraint in real space. Results are shown here for two samplings schemes, one using all of the oversampled data, with an oversampling factor of 27 relative to the Bragg samples, labelled here as sampling scheme A, and one using only the Bragg samples and their nearest neighbours, labelled here as scheme B (oversampling factor of 7). We use two SNRs to describe the noise levels in the experiment [22]. The first, denoted SNR

_{M}, is the average SNR of the measured intensities after Monte Carlo integration. The second, denoted SNR

_{P}, is the SNR of the intensities used for phasing, i.e. the measured intensities after division by the averaged shape transform function as per equation (2.2). The ratio SNR

_{P}/SNR

_{M}therefore measures the reduction in the SNR as a result of dividing through by the averaged shape transform function [22] and is shown versus mean crystal size in figure 1

*a*. This shows the effect of an increasing noise amplification with increasing crystallite size and the reduced noise amplification for sampling scheme B over that for scheme A. Examples of reconstructions for an average SNR at the detector of 20 and

*μ*= 10 are shown in figure 2 [30]. The improved reconstruction provided by sampling scheme B over sampling scheme A is evident. Further improvements may be possible using a noise-tolerate version of the phase retrieval algorithm [33].

_{N}The above-described simulations were repeated for a variety of SNRs at the detector and mean crystallite sizes. The results are summarized in figure 1*b* which shows the root-mean-squared (RMS) error in the reconstructed electron density using the two sampling schemes, versus noise-to-signal ratio (NSR = 1/SNR) and mean crystallite size [30]. The improved performance of the selective sampling scheme is evident here also in that, for a fixed crystal size, the same RMS error in the reconstruction is attained at a higher noise level at the detector for sampling scheme B than for sampling scheme A. As noted above, for a fixed incident X-ray pulse flux, the diffracted intensity is larger for larger crystals, and so is the SNR, so that, in this case, the performance for larger crystals would be improved.

## 3. Effects of variable and incomplete unit cells

The form of equation (2.1), on which the proposals for direct phasing described above are based, assumes that each crystal can be described as a single unit cell, with Fourier transform *F*(**u**), that is replicated by a finite crystal lattice. This is a good approximation in the limit of large crystals, but is not the case for a small crystal that has more than one molecule per unit cell. This can be seen in figure 3, which shows three possible crystals for the case of four molecules in the unit cell. The crystals shown in figure 3*a* and figure 3*b* are made of complete unit cells, although the unit cells are different, and the crystal shown in figure 3*c* is made up of complete unit cells as well as some incomplete unit cells on the surface that contain fewer than four molecules. In general, crystals of these kinds are to be expected, with a variety of kinds of complete unit cell and a variety of kinds of incomplete unit cell on the surface. The crystal ensemble will then consist of crystals of these kinds, and equation (2.1) will not apply. Clearly, this will have an effect on the proposals for direct phase retrieval described in the previous section.

These characteristics of nanocrystals have been recognized recently and their implications for direct phasing have been the subject of a number of studies [34–36]. We described a nanocrystal as a sum of two parts in [34]: a complete unit cell part and an incomplete unit cell (surface) part. We showed that the average diffraction from an ensemble of such crystals consists of three components. The strongest component is the usual Bragg diffraction from the complete unit cell parts of the crystals, but it is modulated by the average of the intensities diffracted by the different kinds of complete unit cell. The other two components are due to interference effects with the surface (incomplete) unit cells. These two components Bragg-like with shape transforms similar but differently, to that for the Bragg component, and are weaker than the Bragg component. We argue that the diffraction can be approximately described in a form similar to equation (2.1), which implies that it may be possible to extract a sufficiently accurate averaged molecular transform even in the presence of incomplete unit cells. Simulations of diffraction by ensembles of two-dimensional nanocrystals generated via a random aggregation process similar to the Eden growth model [37] support our theoretical results.

Liu *et al.* [35] conducted simulations of diffraction by two-dimensional nanocrystals with two molecules in the unit cell and with random edge terminations by incomplete unit cells. They showed that even with incomplete unit cells, a good shape transform function could be estimated, and division by this gave a good reconstruction of the molecular transform amplitude if one unit cell type dominated the ensemble, or a transform resembling the average molecular transform amplitude if there was no dominant unit cell type in the ensemble. They also applied phase retrieval to these transform amplitudes and obtained good reconstructions in the case that one kind of unit cell dominates. They showed, as expected, that a large dynamic range of the measurements is required to obtain good results.

Kirian *et al.* [36] describe nanocrystals with incomplete unit cells in a slightly different manner as, in effect, a set of interdigitated crystals, each containing one molecule in the unit cell, but having slightly different boundaries in order to produce incomplete unit cells on the surface. This leads to an expression for the averaged-diffracted intensity that contains a term for each single-molecule crystal, and an additional term that is related to interference between the single-molecule crystals. This formalism is complementary to that of [34] and may be useful. They calculated diffraction patterns from randomly terminated two-dimensional crystals with two molecules in the unit cell, averaged the reflection profiles to estimate the shape transform, divided the diffraction pattern by this estimate, and obtained a rather good estimate of the averaged molecular transform, indicating that the effect of the incomplete unit cells on the surface is rather small. They then phased the averaged molecular transform by reconstructing a single molecule (asymmetric unit) by projecting onto the averaged intensity using a variation of the method of Elser & Millane [28] for reconstruction from averaged diffraction intensities. The approach was successful in simulations and appears to be promising.

Further insights into the problem can be obtained as follows. Returning to the formulation of [34], the *i*th crystal in the ensemble is represented as the sum of a complete unit cell part and an incomplete unit cell part, i.e.
3.1where *g _{i}*(

**x**) is the electron density of the

*i*th nanocrystal, the

*f*(

_{j}**x**) are the electron densities of

*M*different kinds of complete unit cell,

*l*(

_{ij}**x**) and

*p*(

_{ij}**x**) are the corresponding finite lattice and incomplete unit cell part, respectively, and ⊗ denotes convolution. Equation (3.1) emphasizes that the assignment of a complete unit cell for the complete unit cell part of the nanocrystal is not uniquely defined, i.e. equation (3.1) is valid for each of the different values of

*j*. The ensemble-averaged-diffracted intensity can then be written as the sum of three components [34]: 3.2which represent the interference effects within the complete unit cell part (

*I*(

^{B}**u**)), between the complete and incomplete unit cell parts (

*I*

^{B}^{2}(

**u**)), and within the incomplete unit cell part (

*I*

^{B}^{3}(

**u**)). The term is the average over the different kinds of complete unit cell that occur in the crystal ensemble and is the averaged shape transform of the complete unit cell parts in the crystal ensemble. The quantities

*H*

_{B}_{2}(

**u**) and

*H*

_{B}_{3}(

**u**) are functions of the transforms of the different kinds of complete and incomplete unit cells that occur, and and are their corresponding shape transform functions. For more details, the reader is referred to [34]. If is small compared with

*I*(

^{B}**u**) then we have that 3.3and the method of [26] may still recover a reasonable estimate of from

*I*(

**u**). The results of [36] suggest that, at least in some cases, this may be the case.

All proposals so far for direct phasing with more than one molecule in the unit cell involve first estimating from the diffraction data. However, in view of the non-uniqueness implied by equation (3.1), the interpretation of is not well defined, because there is not a unique choice of the set of unit cell transforms to be included in the average . An immediate question therefore is: what is the best choice of the set , or equivalently the set {*f _{j}*(

**x**)}? Given the comments above, it is clear that the best separation in equation (3.1) is one that minimizes relative to

*I*(

^{B}**u**) in equation (3.2). For an ensemble of crystals, this objective is achieved by choosing the set {

*f*(

_{j}**x**)} that minimizes , where denotes the two-norm. In general, feasible unit cells will be those that can produce a physically viable crystal surface. This will depend on the particular case at hand, which will depend on the space group and the particular arrangement of molecules and molecular contacts. In some cases, this may correspond to a choice for {

*f*(

_{j}**x**)} that consists of the set of unit cells that have the same size and shape as the unit cell of the underlying infinite crystal lattice, and are all possible cyclic shifts of each other that contain full molecules.

Once the average is obtained, the problem is to reconstruct the contents of *one* of the possible unit cells, or equivalently one molecule (the asymmetric unit). The results of [36] show that this is feasible, at least for the case of two molecules in the unit cell. The averaged data imply a loss of information relative to the data from a single unit cell, . The situation is similar to that considered by Elser & Millane [28] in which one measures the diffracted intensity of a single object, averaged over a set of *M* symmetry operators in reciprocal space. They showed that under general conditions, the object could be reconstructed for *M* ≤ 4, although *M* = 4 is the marginal case. This result is a consequence of the general three-dimensional phase problem for the continuous Fourier amplitude being overconstrained by a factor of 4. In the case at hand, we have the average of the diffracted intensities over a set of *M* objects (the set of *M* unit cells {*f _{j}*(

**x**)} as defined above) in real space. The same argument as above applies to this case, so that unique inversion of for the molecule is expected if

*M*≤ 4. Because the molecules in the unit cell are identical and related by space group symmetry, we can write the contents for, say, the first unit cell definition as 3.4where

*ρ*(

**x**) is the electron density of one molecule in the unit cell,

*R*and

_{k}**b**

*are rotation and translation operators, respectively, and we have assumed that the space group symmetry consists of*

_{kl}*M*rotations of the molecule, and for each rotation, there are

_{r}*M*translations (other cases are easily accommodated and give the same result). Using the choice of unit cell definitions {

_{t}*f*(

_{j}**x**)} as described above, the electron densities for the other unit cells can then be written as 3.5where

**c**

_{jkl}are the appropriate translations.

Evaluating shows that it is then given by
3.6where is the Fourier transform of *ρ*(**x**), and
3.7

Inspection of equation (3.6) shows that there are only *M _{r}* distinct object (molecular) transforms contained in , i.e. the translations enter only though which is independent of . We therefore conclude that recovering

*ρ*(

**x**) from should be feasible as long as the number of distinct rotations of the molecule in the space group is less than or equal to 4, i.e. if

*M*<

_{r}*M*then the requirement

*M*≤ 4 can be relaxed to

*M*≤ 4.

_{r}## 4. Summary

Nanocrystallography using XFELs offers the possibility of direct phasing of the diffraction data using measurements of the diffracted intensities between the Bragg reflections. There have been two proposals to effect this: one that first estimates the molecular transform, followed by conventional phase retrieval, and one that works directly with the Bragg amplitudes and estimates of the gradients of the diffracted intensity. Both methods are sensitive to noise, however. For the first approach, using samples of the continuous transform that have a higher SNR, while retaining enough samples to keep the phase retrieval problem well determined, allows the effects of the low SNR to be reduced. Protein nanocrystals in non-trivial space groups with more than one molecule in the unit cell will generally crystallize with unit cells with different contents, and with incomplete unit cells on their surface. The diffraction from an ensemble of such crystals depends on the average over the different possible unit cells and contains additional terms that depend on the incomplete cell surface structure. Theoretical and numerical studies of this problem by various groups so far indicate that the effect of the incomplete unit cell surface may be small, and that image reconstruction is feasible in the presence of multiple kinds of unit cell. A unique solution is likely if there are four or fewer different molecular orientations in the unit cell.

## Funding statement

This work was supported by a James Cook Research Fellowship to R.P.M., and a UC Doctoral Scholarship and an R.H.T. Bates Postgraduate Scholarship to J.P.J.C.

## Acknowledgements

The authors thank John Spence, Rick Kirian, Richard Bean, Ken Beyerlein and Andrew Martin for helpful discussions.

## Footnotes

One contribution of 27 to a Discussion Meeting Issue ‘Biology with free-electron X-ray lasers’.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.