Short-form publications such as Plant Disease reports serve essential functions: the rapid dissemination of information on the geography of established plant pathogens, incidence and symptomology of pathogens in new hosts, and the discovery of novel pathogens. Many of these sentinel publications include viral sequence data, but most use that information only to confirm the virus' species. When researchers use the standard technique of per cent nucleotide identity to determine that the new sequence is closely related to another sequence, potentially erroneous conclusions can be drawn from the results. Multiple introductions of the same pathogen into a country are being ignored because researchers know fast-evolving plant viruses can accumulate substantial sequence divergence over time, even from a single introduction. An increased use of phylogenetic methods in short-form publications could speed our understanding of these cryptic second introductions and aid in control of epidemics.
In the mid-1990s, the emerging and damaging tomato yellow leaf curl virus (TYLCV) was found in tomato plants in the Caribbean. This Old World virus had never before been seen in the New World and quickly spread to other Caribbean islands, to eastern Mexico and to Florida. From Florida it spread North and West, reaching Alabama in 2005 (Akad et al. 2007) and Texas in 2006 (Isakeit et al. 2007). At the same time, tomato plants in western Mexico were succumbing to TYLCV (Brown & Idris 2006). As the western Mexican sequences were 98 per cent identical to those from the eastern Caribbean, the infections in western Mexico were thought to be an extension of the initial introduction (Idris et al. 2007). Despite this substantial per cent nucleotide identity (PNI), subsequent phylogenetic analysis revealed that these western Mexican sequences were more closely related (greater than 99%) to TYLCV in Asia than in other North American isolates and represented a second introduction of this exotic virus into North America (Duffy & Holmes 2007).
If plant viruses were typically introduced into new locations once and only once, then the different, more complete perspective that phylogenetics provided for TYLCV infections in the New World would be interesting, but of ultimately limited value. We know, however, that multiple introductions are a frequent occurrence. In the case of TYLCV in the New World, a third introduction, of a mild TYLCV isolate into Venezuela, has also been documented (Zambrano et al. 2007).
The possibility of overlooking a second or third introduction of the virus into a country or area stems from the temptation to compare the newly obtained sequence to those of viruses previously sequenced from the same country or from nearby countries. However, infected plant material and disease vectors are inadvertently traded around the world and phylogenies of genes or genomes of individual species of plant viruses often reveal that viruses from distant geographical areas are closely related to one another. For example, outbreaks of iris yellow spot virus in onions from the American state of Georgia are very closely related to strains circulating in Peru (Nischwitz et al. 2007). Some pathogens move frequently and are repeatedly re-introduced to certain geographical areas. For instance, cassava mosaic begomoviruses have migrated from eastern Africa to western Africa at least twice (Ndunguru et al. 2005), while the maize-adapted maize streak virus A has frequently moved around Africa (Varsani et al. 2009). Although incomplete sampling of the diversity of plant viruses hampers definitive source tracking (Moury et al. 2006), attempting to find the origin of novel viral sequences can become a useful standard in the field. Placing novel viral sequences into their appropriate phylogeographical context can identify infection source countries, help trace back how the pathogens broached agricultural security measures and give the phytopathology community the most complete picture of each novel viral sequence.
2. Material and methods
For each analysis, sequences were obtained from GenBank and aligned and trimmed manually with Se-Al v2.0a11 (http://tree.bio.ed.ac.uk/software). No statistically significant recombination breakpoints were detected by more than two of the following algorithms as implemented in RDP 3.15 (Martin et al. 2005): RDP, GENECONV, Bootscan, MaxChi, Chimaera, SiScan and 3Seq. Therefore, recombination was not considered in further phylogenetic analyses. Nucleotide substitution models were selected by Akaike's information criterion (AIC) with ModelTest 3.7 (Posada & Crandall 1998). Maximum-likelihood phylogenetic analyses were performed with PAUP* 4.0 beta (Swofford 2003) and bootstrapped with 1000 replicates. Trees were manipulated with FigTree v.1.2.3 (http://tree.bio.ed.ac.uk/software/figtree/), midpoint rooted for clarity and presented with branch lengths scaled to the numbers of substitutions/site.
3. Results and discussion
A survey of several journals that publish short-form reports on new plant diseases or established diseases in new plants or locations revealed that phylogenetic methods are rarely used when describing a novel viral sequence (table 1). Only 3.6 per cent of the more than 200 viral reports published in three journals (BSPP's New Disease Reports, APS' Plant Disease Reports and Plant Health Progress) contained a phylogenetic tree that a reader could look at and evaluate. Far more popular was reporting the PNI to other strains of the virus. Sometimes, the PNI was explicitly aided by NCBI's BLAST (Basic Local Alignment Search Tool), which was used to identify closely related isolates in GenBank for comparison. Often, however, PNI was calculated relative to isolates without any rationale for why the specific isolates were selected. A third category was needed for reports that communicated that they had created a phylogenetic tree, but did not provide the tree to the reader (though presumably they would, upon individual request). Most of these reports aimed solely to communicate that a virus had been found in a new plant or place, not to say anything about its biogeography, nor assert where the infection had come from. For mere pathogen identification, PNI is adequate.
(a) Per cent nucleotide identity: good, and sometimes good enough
If the only goal is determining what virus is present in a diseased plant, then obtaining a sequence with species-specific primers and confirming that it is very similar to known isolates of a virus is sufficient, and is more sensitive than serological techniques (Schneider et al. 2004). The vast majority of short-form plant virus reports use sequence data in this way. In fact, some reports do not mention exact PNI values because the authors felt it was sufficient to mention that the sequences were highly similar to one another.
Importantly, many virus families use a threshold PNI to determine if a novel viral sequence represents a new species (determined and revised by the International Committee for Taxonomy of Viruses (Fauquet et al. 2005)). For instance, begomoviruses are of the same species if they are more than 89 per cent identical over the full-length DNA-A segments from previously characterized species, while their sister group, the mastreviruses, use a cut-off of 75 per cent (Fauquet et al. 2008). The single-stranded RNA potyviruses use a threshold value of 85 per cent (Fauquet et al. 2005), but there is discussion of reducing this to 76 per cent (Adams et al. 2005). It is necessary to include PNI when characterizing a novel viral sequence from a family that has a threshold PNI in order to assess whether or not it represents a novel species.
(b) BLAST can be better
PNI is a better measure than simply the presence or absence of a PCR band, since it confirms that what lies between those sequence-specific sites is the expected sequence, and does not ignore insertions and deletions. If the researchers do not wish to undertake a more complete phylogenetic analysis, using BLAST can be an intermediate step (Altschul et al. 1990). BLAST compares a query sequence to the entire GenBank non-redundant nucleotide sequence collection, looks for high identity matches, and selects sequences that closely match the entire query sequence. The matching sequences are ranked by expect scores (E-values) that correspond to the relative likelihood of the match being identified by chance alone. If one uses BLAST to query a novel viral sequence, and it is a member of a viral species or genus that has many sequences in GenBank, the results will show the publicly available sequences of that group to which the novel sequence has the highest PNI. These sequences increasingly have their country and time of isolation in their GenBank files or in an accompanying publication. However, it is important to note that these details must be explicitly specified; the year of submission to GenBank and the country of the submitting scientists are not reliable indicators of when and where a virus was isolated. By using BLAST, one can find the most similar viruses for PNI comparisons without preconceived notions about sequences from particular geographical areas to which the novel sequence should be most closely related.
(c) Some situations are perfect for phylogenetics
If PNI is good, and BLAST is better, then the most thorough placement a novel viral sequence can initially receive is through phylogenetic analysis. Not every new report of a plant disease requires a phylogeny, but reporting the first incidence of a pathogen in a new location, without attempting to determine where it could have come from, shortchanges the scientific community. In order to increase plant biosecurity, each country or region needs to know how and from where pathogens previously entered the region (Rondoni 2009). If novel sequences come with biogeographical information in the initial report, it will speed the process of highlighting frequent sources of infection and consistently leaky ports of entry in interstate and international commerce. These analyses could lead to increased vigilance when screening imports from a subset of countries that are consistent sources of phytopathogens. This is especially critical, as trade agreements have weakened the ability of many countries to routinely quarantine plant material (Jones 2009; Rondoni 2009).
PNI analyses make some of the same implicit assumptions as phylogenetic analyses: that the alignment of the sequences is correct and reflects homology. By choosing one or a few sequences to calculate PNI against, the author is making the a priori decision that these are the most informative sequences with which to compare the new sequence. By contrast, phylogenetic analyses that involve more sequences from a wider geographical range than is typically employed in PNI analyses allow unexpected relationships between sequences to emerge.
Recombination can obfuscate patterns of common descent because it means different sequences with different evolutionary histories are physically joined together in the genome. Recombination is a frequent occurrence in many plant viruses (Chare & Holmes 2006; Lefeuvre et al. 2007), and software programs exist to detect recombination breakpoints (many are collated into RDP3, (Martin et al. 2005)) so that the evolutionary relationships of different portions of the genome can be analysed separately. As PNI makes fewer assumptions about ancestry, it is less affected by the presence of recombination than are phylogenetic analyses. Alignments destined for phylogenies should be screened for recombination, and researchers must be very cautious about further analysis with the entire alignment. One approach is to break up the dataset into smaller alignments at recombination breakpoints and analyse them separately. Another is to aim not for a single, bifurcating phylogenetic tree, but a network (Huson & Bryant 2006). The blended evolutionary history of several plant viruses has been traced using split network methods in SplitsTrees4 (Hu et al. 2007; Codoner & Elena 2008; Martin et al. 2009). While networks can better reflect the true ancestry of many recombinant plant viruses, the analysis of migration and hypothesis testing is more complicated on a network than bifurcating trees (Huson & Bryant 2006).
BLAST analysis shares many assumptions with phylogenetic analyses, and has an understandable bias towards finding the most closely related sequence over the longest stretch of nucleotides. BLAST will score a longer but lower similarity match higher than a shorter, more similar match. As many viral sequences in GenBank are from relatively short species-specific PCR amplicons, this means that using BLAST on a longer viral sequence, such as the whole genome, will not necessarily return the highest PNI match over the highly sequenced, species-specific amplicon region. Rather, it will return the highest PNI match for the entire length of the query sequence. Alignments for phylogenetic analyses can be trimmed such that all sequences have the same length and the algorithms compare across an even amount of information. There is an obvious disadvantage to eliminating part of the new viral sequence from analysis and consideration, but perhaps the solution is to present both a phylogenetic analysis based on a good alignment and the PNI from a BLAST analysis with the full sequence.
Phylogeographical approaches not only allow epidemiologists to trace the source of infections, they also can provide a measure of how confident researchers should be in their conclusions. Through estimates of support for clades, from bootstrap analyses or from Bayesian posterior probabilities, authors and readers alike can assess how probable it is that the alignment underlying the phylogenetic tree is representative of a real relationship between sequences that are closely grouped on the tree. This allows authors to move beyond suggestion and give a relative measure of the support for their assertions. For instance, when zucchini yellow mosaic virus (ZYMV) was first discovered in Poland, pair-wise PNI comparisons indicated it was more closely related to sequences from Asia than to other European strains (Pospieszny et al. 2007). However, these authors were unable to give a measure of how confident they were in these relationships. The phylogenetic relationship between Polish (and now two French) ZYMV isolates and a Chinese isolate from 1999 was published in 2009, and revealed moderate 77 per cent bootstrap support, lending increased credibility to the authors' conclusions (Lecoq et al. 2009). Several additional examples of the utility of phylogenetic analysis inspired by the recent plant virus report literature are given below.
(d) Tomato yellow leaf curl virus
As TYLCV has continued to spread in North America and the Caribbean, sequences have recently been added to GenBank from viruses isolated in Arizona (Idris et al. 2007), Guadeloupe, Grenada, Kentucky (de Sá et al. 2008), Martinique (Urbino & Dalmon 2007) and Mexico (Idris et al. 2007; Gámez-Jiménez et al. 2009). An updated tree created from an alignment of partial coat protein genes, which recapitulates the two geographically distinct New World clades (Duffy & Holmes 2007; Zhang et al. 2009), is given in figure 1. The more recently isolated viruses were published with PNI to a range of closely related TYLCV isolates, but our phylogenetic analysis offers better resolution of the ancestry of some of these strains. The publication describing the partial genome sequences of TYLCV isolated in Kentucky reported them to be 98–99% identical to TYLCV-US:TX, TYLCV-US:AZ and TYLCV-US:SC (de Sá et al. 2008), but our analysis provides support for the Kentucky strains being more closely related to the Texan strain in particular. Similarly, the first Californian TYLCV isolate had a very high PNI to the strains from western Mexico (Rojas et al. 2007), but the tree in figure 1 places a quantitative measure of the level of support in the alignment for the grouping of TYLCV-US:CA with the Sonoran and Sinaloan strains.
(e) Banana bunchy top virus
The economically important banana bunchy top virus (BBTV) is a threat to banana crops in Asia, on Pacific Islands, in Australia and in the Middle East (Amin et al. 2008). The spread of BBTV into several novel geographical areas has been documented, especially in Hawaii, where the introduction of the virus to each island can be traced and dated using molecular phylogenetic methods (Almeida et al. 2009). That analysis revealed evidence of two introductions of BBTV onto Kauai island, despite 99.6 per cent or more nucleotide identity among the Kauaian sequences. Thus, the use of phylogenetic methods revealed continued exchange of infected plant material and/or infected banana aphids among the Hawaiian islands, which would have almost certainly been overlooked if the researchers only used PNI (Almeida et al. 2009).
We constructed a gene genealogy for all available coat protein genes of BBTV to see if we could detect any other cryptic second introductions of this virus. In figure 2, we highlight sequences from the Hainan province in China and show that two coat protein alleles are circulating within the region. Hainan is an island in the South China Sea and its first BBTV isolates grouped with those of nearby Vietnam (Jun & Zhi-Xin 2005). The more recently sequenced isolate from June 2008 is more closely related to viruses from the Chinese mainland (direct submission to GenBank, accession number FJ463044). The older Hainan BBTV isolates are 99.69 and 99.57 per cent identical to FJ463 044—again very high PNIs that make an incorrect, direct, ancestral relationship between the older and newer Hainan isolates seem plausible.
(f) Tomato chlorosis virus
In 2006–2007 tomato chlorosis virus (ToCV) was identified for the first time in South America, in Sumaré, Brazil (Barbosa et al. 2008). The researchers who sequenced portions of this Brazilian ToCV found that the strain was 99 per cent identical to ToCV from the USA. The phylogenetic analysis in figure 3 suggests, albeit with lower bootstrap support than that observed in our other analyses, that ToCV in Brazil is more closely related to ToCV from the Mediterranean (Greece, Turkey and Lebanon) than to strains from North and Central America. When the 463 base partial genome sequence was trimmed to make it align with other ToCV sequences in GenBank, the 309 base region of the Brazilian isolate was more than 99 per cent identical to Turkish and Lebanese isolates, and 98.7 per cent or less identical to the other New World isolates. In addition to the lower PNI in this region than what the Brazilian strain shares with the Mediterranean strains, the Brazilian strain does not have two common synapomorphies (mutations) shared among the other New World isolates. This analysis suggests that ToCV might have been introduced twice to the New World, but the relatively weak bootstrap values on the tree make any definitive statement inadvisable. As future isolates are collected and analysed, the hypothesis of multiple introductions can be more thoroughly examined.
These examples show that a phylogenetic approach can provide a geographical context to novel viral sequences, and either provide support for intuitive relationships, or introduce the notion that viruses have migrated multiple times into a region.
(g) No additional experiments
For researchers who are already using sequencing to identify and confirm the causative agent of a disease, no additional wet-lab work is needed to conduct a phylogenetic analysis. While there is a learning curve for using phylogenetic programs, many of the relevant programs are free or low-cost, and tutorials exist online and in book form. One approachable volume that now assists readers using Mega (Tamura et al. 2007) is Barry Hall's third edition of Phylogenetic Trees Made Easy (Hall 2007). Beginning to use phylogenetic methods opens the door to more advanced hypothesis testing. One directly relevant application would be comparing the likelihood of two hypothetical evolutionary histories: one where a virus is allowed to have multiple introductions to a geographical region, and one where all isolates from the geographical region must be descended from a single introduction (e.g. Duffy & Holmes 2007).
In addition to noting multiple introductions of a virus and identifying weak points in plant biosecurity, it is important for disease management to know that multiple strains of a virus are circulating in the same region. Co-infection of the same plant by multiple strains of a virus can lead to more severe symptoms owing to synergistic action or the rare creation of a more virulent genotype, both illustrated in the cassava mosaic disease outbreak in Uganda in the late 1990s (Legg et al. 2006). With foreknowledge of multiple strains in an area, researchers could begin monitoring to see if recombinant strains emerge and are associated with more severe symptoms. However, plant pathologists need to be aware of the potential problem before they put the time and resources into increased vigilance.
This work was supported by funds from the Rutgers School of Environmental and Biological Sciences and the New Jersey Agricultural Experiment Station.
One contribution of 14 to a Theme Issue ‘New experimental and theoretical approaches towards the understanding of the emergence of viral infections’.
- © 2010 The Royal Society