Wednesday, August 24, 2005

Alternative splicing and host defense in flies and plants

In an article appearing online in Science this week, and discussed in The Scientist, Watson et al. (PubMed) implicate the Drosophila Dscam gene in host defense. They detect secreted forms of the protein in hemolymph and show that the gene enhances phagocytosis of bacteria by hemocytes. They also demonstrate conservation of the potential for extreme isoform diversity across insect taxa, an extension of earlier work from the Graveley lab (Graveley et al. 2004; PubMed, RNA journal). Isoform diversity due to alternative splicing is therefore implicated in the generation of adaptive variation in host defense molecules. It is interesting that isoform diversity due to alternative splicing of Toll-like proteins has likewise been implicated in plant defense (reviewed by Kazan 2003 and Jordan et al. 2002; an example is Zhang & Gassmann 2003).

What kind of adaptation does this make possible? Certainly, extreme variability allows rapid adaptation on a population level. Furthermore, the presence of membrane bound and secreted forms of the same molecule presents the possibility of adaptive immunity through clonal selection of hemocytes that see antigen. Louisa Wu pointed me to an article in Nature Immunity (Little, Hultmark and Read 2005) making the point that neither memory nor specificity has been ruled out in invertebrate immunity. True adaptive immunity in insects would be very exciting, but we're a long way from that. How could variation in isoform production among hemocytes in Dscam isoforms be heritable? Through epigenetic silencing of splicing factors? We're just at the beginning of this story.

The authors say this:

broad conservation of receptor diversity strongly suggests important
functions and future studies will have to further address
whether the presence of diverse immune receptors in
invertebrates increases the effectiveness of immune responses
of individual animals. Alternatively, given the relative short
life span of many invertebrates, it may be that immune
receptor diversity is less important ontogenetically but rather
enhances the adaptive potential of animal populations to
changing environmental and pathogenic threats.

Tuesday, August 16, 2005

Nature Genetics and the Mid-Atlantic Plant Molecular Biology Society Conference

There is always something interesting in Nature Genetics, but the July issue seems especially rich.

I appreciate this editorial. The advice here (e.g. that postdocs and advisors make a formal plan, and that postdocs ask themselves such questions as "is this the most important scientific question I can ask") is excellent. Anyone considering a postdoc, or taking on a postdoc, should read this.

I am often asked (especially by educated non-scientists in my acquaintance) about genetics and race. This is an old debate and there are excellent sources of information and opinion (including a Social Sciences Research Council forum, a special issue of Nature Genetics and an edited volume by Jefferson Fish; a more complete listing is on The bottom line is that race is indeed a social construct. (At least it is very poorly defined within biology, and what biological definitions might be partially valid differ significantly from the way the concept is normally used in our society). The licensing of BiDil specifically for African Americans is therefore troubling. It seems to me that if a drug differs in either safety or efficacy for one "race" or another, then the underlying basis is probably either a genetic difference or a cultural difference. In the first case, the relevant genetic difference itself, or a related biomarker, would be much more reliable than popular notions of race. On the other hand, if the basis is cultural, the relevant practice (such as lifestyle or diet) should be identified. I was therefore gratified to see Nature Genetics publish this letter from Jonathan Kahn making the case against the misuse of race, as well as a sidebar showing how the media has misrepresented their own statements.

Transcriptional Gene Silencing, RNA polymerase IV and siRNAs
The association of specific RNAs (siRNAs) with silenced chromosomes presents something of a paradox (since the siRNAs themselves must be transcribed). This paradox is elegantly resolved by the discovery of "RNA polymerase IV," which is presumed transcribe otherwise silent regions, at least in Arabidopsis (Kanno et al.: "Atypical RNA polymerase subunits required for RNA-directed DNA methylation" Nature Genetics; PubMed; other recent papers cited therein and a News and Views by Vaucheret). In other species RNA polymerase II is implicated (e.g. Schramke et al.) but there may be less siRNA corresponding to silenced loci in those species. On a related note, I was impressed by the massive amounts of MPSS data on Arabidopsis siRNAs presented by Pam Green at the MAPMBS meeting last week. This data includes over 75,000 different siRNA sequences and will soon to be online at in a browsable form.

Structural genomic variation within species
One of the insights I came away from last year's MAPMBS meeting with was the idea (Rafalski, MAPMBS2004, PubMed) that "races" of maize show significant variation in gene content due to small (sub megabase scale) structural differences: insertion, deletion and inversion. Although a speaker at this year's meeting expressed the opinion (based on sequence data) that the case in maize may have been overstated, another paper in Nature Genetics (Tuzun et al., Fine-scale structural variation of the human genome; NG; PubMed) reports 297 cases of "intermediate scale" structural variation in a single human individual! It will be interesting to see how this plays out with more time, but SNPs may well be displaced by presence/absence variation as the focus of attention in human genetics. As Charles Lee notes in his News and Views piece, what we see depends on our technology for looking, and I am reminded that a lot of early work in population genetics was based on inversions visible on polytene chromosomes.

Friday, August 12, 2005

Global regulation of alternative splicing: starting with Nova

The global analysis of alternative splicing is complicated by the fact that standard microarrays, and even tiling arrays without junction oligos, do a poor job of reporting on the ratio between alternatively spliced mRNA isoforms that share most of their nucleotides. Only in the past few years has alternative splicing data from arrays been reported (see Clark et al., 2002; Johnson et al., 2003; Stolc et al. 2004 and Pan et al. 2004). It was therefore exciting to see the paper by Ule et al. in the new issue of Nature Genetics reporting the effect of Nova2 knockouts on global patterns of alternative splicing in the mouse brain.

A custom microarray from Affymetrix was used for this study. Although I applaud the efforts of Hui Wang and John Blume to bring alternative splicing to the Affymetrix platform (and [full disclosure] I own some Affymetrix stock), custom chips are extremely expensive. I have my eyes on the Agilent platform used by Pan et al. and what I would really like to see is the widespread use of a common (inexpensive) platform so that publicly available data can be mined for unexpected associations.

Another notable aspect of this study is the truly remarkable degree of functional connection between proteins whose isoforms appear to be regulated by Nova2. Figure 5 in this paper makes a compelling case for the idea that while transcriptional regulators can turn gene sets on and off, splicing regulators can fine-tune an entire module for a specific task.

Finally, it is important to note that these experiments are facilitated by the fact that Nova2 knockout mice are viable, which rendered tissue-specific ablation (as practiced by Ding et al. and Xu et al. for similar studies on SR proteins) unnecessary. That is why we consider it a good thing that so many of the Arabidopsis SR proteins we work with are not essential. This does not mean they are not important!

Monday, July 25, 2005

PLoS Genetics

I just got an email that PLoS Genetics has launched. It looks great, and given the quality of the other PLoS journals (particularly Biology and Computational Biology), I'm expecting a lot from it.

Friday, July 22, 2005

Parameters for using blastn with noncoding queries

If one wants to look for a conserved noncoding RNA in a new genome using the best possible tools, then one should use sophisticated structure-based methods such as Klein and Eddy's RSEARCH ( BMC Bioinformatics 4:44 , PubMed), and should consult the RNA database Rfam (Griffths-Jones et al., 2003: Rfam: an RNA family database and 2005: Rfam: annotating non-coding RNAs in complete genomes . PubMed). However, alignment tools such as blast or fasta are more readily available, so it is often expedient to use alignment when other tools would do better. If you do that, you must adjust the parameters – you will never find noncoding RNAs using the default parameters for blast. I confronted this problem at the Drosophila genome jamboree in 1999 and published the parameters I used there in the paper I wrote with Helen Salz (J. Biol. Chem.; PubMed).

Now, I've posted a discussion and "how-to" guide (Posting 4 on based on work that Chau Nguyen (a University of Maryland Computer Science and Biology double major) did with me a few years ago. These are written for use on the NCBI blastn server, but are easily adapted to running blast locally. Briefly, my advice is to use the parameters -r 5 -q -4 -G 10 -E 4 -W 7. These values not only find mammalian U6atac using plant U6atac but provide an alignment across the entire snRNA. If you don't find what you want, you may want to make adjustments based on the more thorough discussion in the posting, where I describe several parameter sets there that will correctly idenfity plant snRNA genes using animal snRNA queries.

Bear in mind two caveats: limit the size of your query and be prepared to use independent criteria for identifying correct hits. These searches require more computing resources than standard blast searches and it will generally take longer than the estimated time for your results to come back. For related reasons, you should not attempt to use these parameters for queries longer than about 500 bp. (if you are using a noncoding RNA as the query do not include nontranscribed flanking sequence in your query; you may even want to remove poorly conserved parts of the RNA itself from your query). Also, because the assumptions that go into calculating E values are violated by these parameters, the E values reported in your output will be meaningless (except as relative numbers; better matches will still have lower E values). Do not pay attention to the E values (except when comparing results obtained with the same parameter set) and do not report them. However, the lack of reliable E values is not license to believe nonsense; your results should be validated by external criteria such as secondary structure and conservation of known functional regions.

Good luck! If you have experience that bears on this, or can cite relevant literature, please let me know and I'll update the posting.

Thursday, June 30, 2005

Things that are not exons

I have thought for many years that the genomics community needs a term other than 'exon' for coding segments. This post points out how lacking such a name has led to misuse of the word 'exon'. I also suggest that the word 'croe' be used instead, but my primary purpose is to call attention to the need for new names. I would be happy to have other names used properly.

This was presented at the Alternative Splicing SIG at ISMB. My presentation in PowerPoint form is available here and is posted on my web site as Posting 3. My hope is that the term be introduced into the Sequence Ontology, but I'll leave it up to my friends there to get it right.

An exon is defined as a segment of a gene that is present in the mature mRNA product of that gene. Genes for noncoding RNAs that are spliced are divided into exons and introns (examples include tRNAs and rRNAs, as well as a variety of noncoding RNA polymerase II transcripts) and every spliced mRNA has at least two exons that are partly noncoding, containing the 5' UTR and the 3' UTR. However, the need to refer to isolated coding segments that are often complete exons but are sometimes only a part of an exon has led many people to use the term 'exon' inappropriately, and this has created confusion. In one extreme case, a published paper presents an "exon size distribution" which includes many coding segments that are only part of an exon. There are many other examples.

Some people are careful to get it right, and many of them use the term CDS to refer to these coding segments. For example, Michael Zhang, in his excellent 2002 review of computational genefinding (PubMed) writes "To discriminate CDS from intervening sequence, the best content measures are the so-called frame-specific hexamer frequencies" and "... hexamer frequencies alone can detect most [long] CDS regions." However, CDS has shortcomings as a word. Foremost among them is its ambiguous meaning. The same exact term is used to refer to the entire coding region of a gene. This is analogous to using the same word for exon and mRNA.

I am grateful to Myles Axton (Nature Genetics 37 :15 (01 Jan 2005) "Touching Base; Full Text | PDF |) for introducing the readers of Nature Genetics to his term for coding segments that are less than an entire exon, which is CROE ( coding region of an exon, pronounced as in "crow"). Because the term 'exon' never communicates anything about where coding information lies, it is important that the term 'croe' apply as well to coding regions that are coincident with an exon. People should be able to say "the croes of this gene" when they refer to the units that together make up a full CDS.

Alternatively spliced segments. I have a related concern that there be a term for segments that appear as indels when two alternatively spliced mRNAs (or cDNAs) are compared. This can be a complete exon, part of an exon (occurring between two alternative splice sites) or an intron, and need not be coding. Kondrashov and Koonin refer to these various mechanisms as generating LDAS (length difference alternative splicing; 2003 PubMed | Trends in Genetics 19:115-9) but do not suggest a name for the segments themselves (other than "alternative segment," or "inserted alternative segment," which terms they use repeatedly). One idea is 'asproe,' for alternatively spliced region of an exon, which has the advantage of being paired with croe (but the disadvantage that a single insertion may consist of two or more croes, alternatively spliced region of exons and will often be less than an entire croe). It is a useful concept. If one has in hand cDNA or EST sequences that differ by an insertion the mode of alternative splicing is unknown, but the alternatively spliced region is clear, even when genomic sequence is not available. Finally, there could be two terms here. One to refer to the alternative segment at the nucleotide level and another to refer to the alternative segment at the protein level. These need not correspond; an interesting case is where the length of the segment is not a multiple of 3 nucleotides, so that the coding of downstream regions is affected. A classical case, found in the first complete eukaryotic genome sequence (SV40), comes from the small t antigen, in which overlapping reading frames are created by alternative splicing.

Wednesday, June 29, 2005

ISMB 2005

I have been in Detroit for a week. The two-day Alternative Splicing meeting preceding ISMB was outstanding, and really crystallized a community of people who are working on genome scale analysis of alternative splicing.

Two things that really struck me at this meeting were:

1) the importance of ontologies (and, more generally, the formal description of scientific knowledge). There were 51 posters in the section on ontologies and NLP. One title that caught my eye was "Transforming Full-Text Literature to Formalized Facts." I was trained to believe that scientific publication was the formalization of facts! I see that it's not good enough anymore. Ewan Birney articulated this clearly in his Keynote address this morning when he said that databases are Biology, "the starting point and the end point of our understanding." I heard calls at this meeting for the formal annotation of data on function analogous to the submission of sequence data. This is clearly coming. Experimental scientists who want their results to be included in emerging system-wide descriptions will have to participate, and informaticians will have to find a way to collect formal descriptions of functional data (Janet Thornton, in her keynote, refered to this as data harvesting and showed a cartooon).

2) The idea that very few people can speak "both languages" (Biology and Computing) is outdated. Being at the alternative splicing workshop really brought this home. It reminds me of being in Miami, where virtually everyone speaks both English and Spanish perfectly. It's still true that the majority of Biologists are still inadequately familiar with databases and computers, and that the majority of computer scientists don't "get" biological questions, but virtually everyone here (a large meeting with well over 1,000 people) is completely bilingual. This is a change from just five years ago and it means that we can stop worrying about translation and get on with the research.

Another very interesting point was in the keynote by Jill Mesirov on the use of Gene Sets. By using predefined sets of genes (her "knowledge base") she was able to apply rank statistics to find signficant differences between microarray data sets between which no single gene shows a significant difference. She has published on these methods (e.g. Brunet et al. 2004) but it was new to me.

The hotel (Renaissance Marriott) was nice in many ways, but had its problems. When I arrived, they could not make keys; I had to be let into my room by a valet and come back later. Once in my room, I discovered that the phone didn't work. The internet was constantly going down (which caused problems for two of the three presentations I saw that used it). Twice (2/7 days), housecleaning did not replace the coffee packets. Access to the hotel itself, and navigation among the first three floors, was absurdly indirect. This design feature is apparently related to ideas of security more evocative of the middle ages (embattled castles protected by moats) than the Renaissance (intellectual excitement derived from an open exchange of people and ideas). The architecture reflects a philosophy which ignores the fact that inaccessibility leads to marginalization. This center houses the General Motors corporate headquaters and I was led to an image of GM executives cowering like Quasimoto in his tower, in this case the Detroit Dark Ages Center, while life goes on below them (and without them).

Saturday, June 11, 2005

Cultural Transmission of Fitness

It was more or less by chance that I read the recent article by Heyer, Sibert and Austerlitz in the April issue of Trends in Genetics about what they call cultural transmission of fitness as carefully as I did. I had it with me on a plane today, and the seats on Northwest Airlink were just too close together for me to get out my laptop. CTF is the nongenetic transmission of fitness, and they make an intuitively compelling case (PubMed) that CTF can have a huge effect on effective population size and coalescence times. Their model appears applicable not only to the transmission of true culture in human populations, but also to epigenetic changes and artificial selection. It's not every day that a new idea in population genetics is articulated, and I found this fairly exciting. However, the idea is more a formulation of ideas that I've been vaguely aware of for a long time than an entirely new idea. This does not to take anything away from them; a formal statement of a phenomenon and its consequences is what constitutes progress in population genetics (the real work is presented in Sibert, Austerlitz and Heyer, Theoretical Population Biology 2002; PubMed). Furthermore, their citations suggest that the idea has been around for a while (although it's new to me). In fact, the applicability of this model to my previous post has apparently already been tested and rejected ("CTF was [not detected] in Ashkenazi Jews")!

Saturday, June 04, 2005

Selection vs. differential allele flow

The media is reporting (NYTimes; Economist) that there is a paper in press in The Journal of Biosocial Science that attributes the pattern of inherited diseases among Ashkenazi Jews to selection for intelligence. This hypothesis breaks not one but several taboos by talking about race, selection and intelligence, so I'm reluctant to say anything at all about it. However, I think that they missed something (I won't be sure until I see the paper, which is not out yet). Selection, "red in tooth and claw," need not be invoked. Differential migration out of the population could have a powerful effect and seems to have been overlooked. In a minority population with asymmetric gene flow (in other words, whenever the rate of assimilation into greater society exceeds the rate of acquisition of new converts) any genetic variation that disfavors assimilation will increase in frequency in the minority population. It is plausible that intelligence could be enhanced by this (for example, if intelligence improved one's ability to learn Torah or become a rabbi and those things made assimilation less likely). It is also plausible that alleles causing non-lethal genetic diseases could actually be favored within a minority population by reducing the probability that affected individuals would leave, which seems likely if the community provided care not available outside and not needed by healthier relatives who were therefore more likely to leave.

Saturday, May 14, 2005

Sean Carroll on Kojo Nnamdi

I happened to catch Sean Carroll on the Kojo Nnamdi show Thursday. He was refreshingly articulate and reasonable on the subject of evolution and religion, pointing to the middle ground and quoting the Pope. This is especially refreshing at a time when so many on the political right have confused a faith they share with much of mainstream American with political views that have no place in a civilized society. The title of this book, "Endless Forms Most Beautiful," cites an explicitly religious quote from Darwin ("There is a grandeur in this view of life, with its several powers, having been originally breathed by the Creator into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being evolved"). I haven't read the book, but I'm familiar with much of Sean Carroll's work and I suspect that he does an outstanding job of laying out that which is indeed most wonderful.

Monday, April 11, 2005

Arabidopsis minisymposium

The sixth Arabidopsis minisymposium, joined this year with the spring Mid-Atlantic section ASPB meeting, was a big success. It's great to be at the center of something, and helping to host a regional meeting of such high quality definitely makes me feel that I am at the center of Arabidopsis research, even though my colleagues are entirely responsible for the excellent selection of speakers and I still have to pause and mentally review whenever anyone relies on my knowledge of photosynthesis, parts of the flower or plant hormones. Caren gets credit for putting the two meetings together and for inviting Susan Lolle to tell us about the work that put Arabidopsis on the front page of the New York Times. Heven, Zhongchi, June and their students all deserve credit for making this happen.

My own talk was well received, even though it was the last and delayed by an unplanned break when the projector overheated after about eight hours of nearly continuous use. I am happy to have made the case before this audience that RNA processing, including alternative splicing, is important in plants. I was aided in this by talks that presented roles for RNA binding proteins in crosstalk between ethylene and auxin (Jose Alonso), pollen tube growth (Mark Johnson) and leaf polarity (Randy Kerstetter); RNA binding proteins are getting hot! The question I most appreciate came from Ken Birnbaum, who challenged me to think of an example in which a forward genetic screen identified regulated alternative splicing. Of course, there is the regulation of flowering time through FLC (Simpson and Dean) but that appears to be regulated by polyadenylation, not alternative splicing.

Tuesday, April 05, 2005

What you can do with a dozen genomes

I really enjoyed the ISR Distinguished Lecture by Eric Green a few weeks ago. It reinforced my excitement about the idea that having a dozen genomes will allow us to obtain qualitatively different information than we’ve been able to obtain from a single genome. In addition to the alignment-based methods he described, there is the (rather amazing) possibility of reconstructing the ancestral sequence (see Blanchette et al. 2004, a very nice paper by an all-star cast) and methods of assigning gene function based on patterns of duplication and loss (e.g. the recent paper by Li, Pellegrini and Eisenberg in Nature Biotechnology). A talk by Najib El-Sayed on Friday about three trypanosomic genomes (Trypanosoma brucei, Trypanosoma cruzi and Leishmania) underscored the prospect of understanding the responses of genomes to selection. It strikes me that with 20-30 appropriately related genomes one could deduce whether individual nucleotides within a conserved block are under selection, an incredibly powerful tool (of course, I’m thinking about ESEs). Like many new methods, comparative genomics will yield insights in ways that will not be fully appreciated until the data are at hand. It is exciting, and it reminds me of the excitement we all felt during the late 70s, when the first sequences were being obtained.

This was originally posted on Steve's View a few weeks ago, soon after the talk from Eric Green.

Thursday, March 24, 2005

"On Genetics"

I have always been interested in the nature of genetic information, including its expression, transmission and change. After five months of intermittent blogging, it occurs to me that it would be useful to separate my personal diaries, opinions and recollections from commentary on genetics and genomics that might be of interest to students and colleagues. For that reason, I'm creating this blog, "On Genetics," which is devoted to my comments on scientific matters related to genetics, genomics and gene expression. If you want to see my views on more personal or political matters visit "Steve's View."