Thursday, February 4, 2016

The Primate Family Tree: A classroom activity in evolution, adaptable for all ages

Feel free to email me to request the document (with all this text and all these figures) which is easier to work with.

In this activity, students will…
Observe and describe the similarities and differences between primates and other familiar mammals, and also the similarities and differences among primates.
Classify primate species into groups (superfamily and above). 
Transform Linnaean classification into evolutionary theory by merely changing the question from How do primates look similar and different from one another? to Why do primates look similar and different from one another?
Build a primate “family” tree by turning Linnaean classification into phylogeny (or evolutionary tree-thinking) to describe how common ancestry and change over time explain both similarities and differences among primates.

How old are the students? 
*Prior to about age 8, teachers will need to riff quite creatively off track from this (for one, because reading the Primate Taxonomy Table may be very difficult for younger students), but I include those earlier ages because I believe that teachers of those age groups can find a way to use this lesson if they’d like to. It's possible that just knowing how to read is enough to do a stripped down version of this activity, which includes children even younger than 5. I have used this activity successfully with students aged 8-25. For upper-level anthropology courses I've used it as an ice-breaker to kick off the semester (with students who have a background in evolutionary thinking already).

How long will it take? 
30 minutes minimum (much more depending on detail you wish to cover)

What materials?
Color pictures of a diverse array of primates - approximately 3-5 times as many pictures as students. Too many is better than too few. Rip them out of old textbooks and laminate for durability. Print them off the Internet ( is one of the best sources) and laminate for durability. Make sure to have at least two different pictures of the same species for many of the species you include. Label most of them with the common names under “examples” on the chart for Part 2 (e.g. “baboon”). But a fraction may be only labeled with geographic region, scientific name, or nothing at all.  Explanation is in the instructions below. Specific sources of primate photos for printing are listed in Appendix A.
Pencil – 1 per student
Note cards -  1 per student (for them to draw a self-portrait or a symbol to represent themselves)
Poster paper, or large sheets of paper – 6, one for each superfamily on the Primate Taxonomy Table (below)
Tacky gum, reusable tape, or some other ingenious sticky tool that can both hold primate pictures to the posters and also be removed and moved to different posters when students change their minds. 
Handout for students (see options below; must at minimum include Part 2: Primate Taxonomy Table)

Teacher Instructions

Using the resources under Part 1 of the materials below, hold a discussion about classification. Don’t talk about relatedness or common ancestry! And especially don’t talk about evolution! (Next, in Part 2, the taxonomic terminology they will use, like “family,” will encourage them to think evolutionarily, hopefully, and this will come into play later in the activity.)  They will already be familiar with how like is grouped with like—I often use sock and underwear drawer analogy, grocery store organization works too. Stick to primates if you’d like, but if you go broader, make sure to end your discussion with primates, including humans. Make sure to explain how (Linnaean) taxonomy/classification works. That is, simply/broadly (or complicated/specifically if you’d like) describe the methods—the use of comparative anatomy and homology and also the binomial species name for the smallest, most exclusive group within ever-more inclusive groups going up to the Kingdom level.

Hang a poster for each superfamily along the wall in no particular order, lay out a pile of the photos in no particular order. Students get up out of their chairs, and using the Primate Taxonomy Table (below) they stick each primate picture to a poster labeled with the superfamily to which they think it belongs. They are able to do this without any knowledge of primates because you have labeled most of the pictures with “baboon”, for example. They can go to the table on the handout and see that baboons belong in the superfamily “cercopithecoidea” and stick the photo to that poster. Unlabeled baboons should look similar to labeled ones and they should also be sticking those to the cercopithecoidea poster if they are carefully observing and are engaged in the activity. Make sure they stick their own “human” cards to a poster too.  Hopefully they will respectfully work together and move others’ around if they think they have a better case for a different classification of a particular primate.  

Primate Taxonomy Table (handout).
Email me for a file you can work with more easily (

Lead the students through a tour of the features that unite the primates into each of the superfamilies. First ask them to describe the similarities among all the primate superfamilies (a quick review of Part 1). Then ask them to describe the differences they can see between the superfamilies: what makes hominoids separate from cercopithecoids, etc?  It’s not imperative, but I recommend starting with the primates that share the most with humans (hominoidea) then going to the cercopithecoidea, and so on. This may appear to be difficult because they may observe overall (super) family resemblance but not be able to describe any more detail than that, which is fine! Using the resources under Part 3 below, you can provide details of the differences between the superfamilies, both that are visible in the pictures and that are not. 

Challenge students to explain the patterns of similarities that make all these creatures primates and that, for example, unite the hominoids, the cercopithecoids, etc..., while also explaining the differences that make each species unique and each superfamily unique.  Hopefully, with very little help from you, they will arrive at the idea that relatedness explains it. Family history, on a larger scale than our own families, but the same kind of thing. Common ancestry and change over time since common ancestors. Evolution plain and simple.  

There are many ways to do this and showing more than one way would be great, but one way is to put the known phylogenetic structure, the branches of the tree for the superfamilies (see resources below) on the wall or board and have them deduce where the superfamilies go and stick the posters to those branches. Another way, for older students, is to have them figure out the relatedness of superfamilies first, starting with humans and hominoids and hypothesizing which are more and more distantly related based on increasing differences. Another is to merely show them how to walk through the table for Part 2, and change it into a hypothesis for phylogenetic/evolutionary history, with lineages diverging where each level of taxonomy divides things further into more exclusive groups. So show them how to draw time and descent lines around that evolution-free taxonomy (classification table for Part 2) that they already have and they’ll arrive at.. dun-dun-DUN evolution! I prefer to draw the students’ hypothesized tree like a big oak tree (with a streps/haplorhine split in the trunk deep down near the bottom) on the wall, and to stick the posters at the ends of the branches. But it’s obviously up to teachers and whether they have a big wall to draw on! I cover my wall in paper first so that I can draw the big tree. 

Teacher Resources and Optional Handout Materials
Teachers: Pick and choose what you’d like to include in your handout, depending on what you will cover with your particular students (depending on age, time, goals, etc…). Be careful not to share any handouts too soon and spoil the opportunity for students to think first, if that’s what you have time for and are going for. 

Part 1
  • Where do humans fit in the classification of life on Earth? (link)
  • How do we make these categories? We ask, ‘What’s similar” of the anatomy, when comparing different species.  That is we look to homologous structures.  A great example is the tetrapod forelimb. (link)

  • What makes a primate a primate? (link)

Part 2
Nose: wet (hence the name strepsirrhine)
Geographic region: Madagascar
Tail present: yes
Activity: Some nocturnal, some diurnal
Teeth: Many more than we have, some shaped like comb for grooming fur
Body size: Small but variable from the smallest primate alive (<< 1 lb.) to ones as big as big pet cats (20 lbs.)

Nose: wet (hence the name strepsirrhine)
Geographic region: Sub-Saharan Africa and Southeast Asia
Tail present: yes and no
Activity: nocturnal
Teeth: Many more than we have
Body size: small

Nose: dry (hence the name haplorhine) 
Geographic region: Southeast Asia
Tail present: yes
Activity: nocturnal
Teeth: Many more than we have
Body size: small

Nose: dry (hence the name haplorhine) 
Nostrils: flat and facing out to the side (hence the name platyrrhine)
Geographic region: Central and South America
Tail present: yes (and some are even prehensile!)
Activity: Most diurnal, some nocturnal
Teeth: four more than humans (one extra premolar/bicuspid in each quadrant of mouth compared to us)
Body size: Variable, with some small (like pygmy marmosets) but the largest, the spider monkey, is 25 lbs.

Nose: dry (hence the name haplorhine) 
Nostrils: facing down (hence the name catarrhine)
Geographic region: Asia, Southeast Asia, and Africa (all sub-Saharan except the Barbary macaque of Morocco and Gibraltar)
Tail present: yes (but 2-3 species are no or have very small stubs)
Activity: diurnal
Teeth: same number as humans
Body size: Many, including most macaques and baboons, weigh more than any ceboids. Some mandrills weigh over 100 lbs.!

Nose: dry (hence the name haplorhine) 
Nostrils: facing down (hence the name catarrhine)
Geographic region: Southeast Asia (gibbons, siamangs, orangutans); Sub-Saharan Africa (gorillas, chimpanzees, bonobos); Worldwide (humans)
Tail present:  no
Activity: diurnal
Teeth: same number as humans
Body size: Although gibbons and siamangs are the smallest of the group, this group is the largest in body size and weight of all primate superfamilies and includes gorillas, the largest of all primates which can weigh 400 lbs.!

Part 4

Evolution and phylogenetic thinking is just family history writ large.

Part 5. The Primate Family Tree

Here is an example (one hypothesis, if you will) of a primate phylogeny or phylogenetic tree or evolutionary tree. 

Here's guidance on how to turn Part 2’s table into a phylogeny with students.

Then here it is, stripped down and rotated...

Appendix A.
Sources for primate pictures






Humans: students draw a personal sign, symbol, or self-portrait

Wednesday, February 3, 2016

Thoughts on the latest schizophrenia genetics report

The news and social media were headlining a report last week that presented some genetic findings, and even aspects of a possible causal mechanism, related to schizophrenia.  As habitually skeptical readers of these daily stories, we wondered how substantial this claim is.

The report in question was a Nature paper by Sekar et al. that identifies variation in the very complex MHC genome region that, based on the authors' analysis, is statistically associated with schizophrenics relative to unaffected controls. These are variants in the number of copies of particular genes in the C4 'Complement' system.  The authors show that gene copy number is correlated with gene expression level and, in turn, with some changes in brain tissue that may be related to functional effects in schizophrenia patients.

Comparing genotypes and disease status, in ~30,000 cases and controls of European ancestry, in 40 cohorts from 22 countries, the authors find that genotypes with higher C4 gene copy numbers are more frequent in schizophrenics, and there is a quantitative relationship between copy number and expression level in postmortem-tested neural tissue.  The relevant potential mechanism involved may have to do with the pruning of synapses among neurons in the brain.

The authors estimate that the relative risk of the highest-copy number genotype is 1.27 times that of the lowest. The lowest risk genotype is rare in the population, comprising only about 7% of the sample population, meaning that almost everyone has a middling relative-risk genotype.  That is comparable, say, to most of us having middling height or blood pressure. But the net population absolute risk of schizophrenia is about 1%, so that the absolute risks associated with these various genotypes are small and not even very different from each other.  The careful work done by the authors has many different components that together consistently seem to show that these copy number differences do have real effects, even if the absolute risks are small.

How that effect or association arises is not clear, and the findings are certainly not the same as explaining schizophrenia as a C4 disease per se.  As the authors note, around 100 or so other chromosome locations have been associated with the disease in genome-wide mapping studies that have been done.  That means that if their results stand up to scrutiny, C4 variation is one component of what is basically a polygenic disorder.  The association for each C4 genotype category is the effect averaged over all other contributing causes in those people. The absolute risk in individuals with a given copy number is still very small, and may depend on other genetic or environmental factors.

Schizophrenia is not a single disorder and has a spectrum of onset age, sex, symptoms, and severity of occurrence.  Many authors have been warning against using a single term for this variety of traits. Whether that is relevant here or not remains to be seen, but at least as presented in the their paper, some of the current authors' results seem not to vary with age.  This study doesn't address whether there is a smallish subset of individuals in each C4 category who are at much higher risk than the average for the category.  However, the familial clustering of schizophrenia suggests this may be so, because family members share environments and also genomic backgrounds.  One might expect that C4 genotypes are interacting with, or if not, being supplemented by, many other risk factors.

Even if average risk is not very high in absolute terms, this paper received the attention it did because it may be the first providing a seemingly strong case for a potentially relevant cellular mechanism to study, even if the specific effect on risk turns out to be quite small.  It could provide a break in understanding the basic biology of schizophrenia, given the dearth of plausible mechanisms know so far.

Because the statistically riskier genotypes are found in a high percentage of Europeans, one would expect them to be found, if at varying frequencies, in other populations than Europeans. Whether their associated risks will be similar probably depends on how similarly the other risk factors are in other populations.  C4 copy number variation must be evolutionarily old because there is so much of it, clearly not purged by natural selection--another indicator of a weak effect, especially because onset is often in the reproductive years and would seem to be potentially 'visible' to natural selection. So why is the C4 variation so frequent?  Perhaps C4 provides some important neural function, and most variation causes little net harm, since schizophrenia is relatively rare at roughly 1% average risk.  Or, copy number changes must happen regularly in this general MHC genome region, and can't effectively be purged, but is generally harmless.  But there is another interesting aspect to this story.

The Complement system is within a large, cluster of genes generally involved in helping destroy invading pathogens that have been recognized.  It is part of what is called the 'innate' immune system. Innate here means it does not vary adaptively in response to foreign bodies, like bacterial or viruses, that get into the blood stream.  The adaptive immune system does that, and is highly variable for that reason; but once a foreigner is identified, the complement system takes part in destroying it.  So it is curious that it would be involved in neural degeneration, unless it is responding to some foreign substance in the brain, or is an autoimmune reaction. But if the latter, how did it become so common?  Or is the use of C4 genes in this neural context a pleiotropy--a 'borrowed' use of existing genes that arose for immunity-related functions but then came also to be used for a different function?  Or is neural synapse regulation a kind of 'immune' function that hasn't been thought of in that way?  Whatever it's doing, in modern society it contributes to problems about 1% of the time, for reasons for which this paper clearly will stimulate investigation.

Why does this system 'misfire' only about 1% of the time?  One possible answer is that the C4 activity prunes synapse connections away normally in a random kind of way, but occasionally, by chance, prunes too much, leading to schizophrenia.  The disease would in that sense be purely due to random bad luck, rather than interacting with other mechanisms or factors. The higher the copy number the more likely the bad luck but too weakly for selection to 'care'.  However, that reason for the disease seems unlikely, for several reasons.  First, mapping has identified about 100 or so genome regions statistically associate with schizophrenia risk, suggesting that the disease is not just bad luck. Secondly, schizophrenia is familial: close relatives seem to be at elevated risk, 10-fold in very close relatives and almost 50-fold in identical twins.  This should not happen if the pathogenetic process is purely random, even though since haplotypes are inherited in close family members there could be a slight correlation in risk.  Also, the authors cite several incidental facts that suggest that C4 plays some sort of systematic relevant functional role.  But thirdly, since the absolute risk is so small, about 1%, one has to assume that C4 is not acting alone, but is directly interacting with, or is complemented by (so to speak) many other factors to which the unlucky victims have been exposed.

Something to test?
This might be a good situation in which to test a variant of an approach that British epidemiologist George Davey Smith has suggested as 'Mendelian randomization'.  His idea is basically that, when there is a known candidate environmental risk factor and a known gene through which that environmental factor operates, one can compare people with a genetic variant exposed to an environmental risk factor to people with that genetic risk factor but not exposed to test whether the environmental factor really does affect risk.

Here, we could have a variant of that situation.  We have the candidate gene system first, and could sort individuals having, say, the highest 'risk' genotypes, compared to the lowest, and see if any environmental or other systematic genomic differences are found that differentiates the two groups.

Interesting lead but not 'the' cause
Investigating even weakly causal factors could lead the way to discovering major pathogenic mechanisms or genetic or environmental contributors not yet known that interact with the identified gene region. There will be a flood of follow-up studies, one can be sure, but hopefully they will largely be focused investigations rather than repeat performances of association studies.

Given the absolute risks, which are small for given individuals, there may or may not be any reason to think that intervening on the C4 system itself would be a viable strategy even if it could be done. This still seems to be a polygenic--many-factorial--set of diseases, for which some other preventive strategy would be needed.  Time will tell.

In any case, circumspection is in order.  Remember traits like Alzheimer's disease, for which apoE, presenilins, beta-amyloid, and tau-protein associations were found years--or is it decades?--ago and still mystify to a great extent.  Or the critical region of chromosome 21 in in Down syndrome that has, as far as we know, eluded intensive study for similarly long times. And there are many other similar stories related to what are essentially polygenic disorders with major environmental components.  This one is, at least, an interesting one.

Tuesday, February 2, 2016

We're all fundamentalists now

If you're a foodie at all, you've heard of Yotam Ottolenghi, chef, restauranteur, and food writer.  Perhaps you've used some of his recipes, or even have one or more of his cookbooks. And, if you're a fan you'll be happy to know that the Jan 11 episode of the Food Programme on BBC Radio 4 has an interview with him (starting at minute 7:15), and a brief overview of how he got to such a place of prominence in the food world.  

Ottolenghi and his business partner, Sami Tamimi both come from Jerusalem, but Ottolenghi from the Jewish west side and Tamimi from the Arab west.  They both now live in London where they have collaborated since the late 1990's on restaurants and delis and cookbooks, much of it with the aim of highlighting the food of their childhood.  Not only is their food amazing, but it's also worth noting that two men from two sides of the same strife-ridden Middle Eastern city have worked closely together for many years. This isn't something that everyone could do.   

One of the cookbooks Ottolenghi and Tamimi wrote together is called Jerusalem, written to accompany a BBC television program of the same name. For the show, they returned to their birthplace and described and prepared some of their favorite foods, but it wasn't just about the food.  Tamimi said it was difficult to return. He believes that people were much more naive when he was a child, having faith that the conflict between Israel and Palestine could be solved. Now, he says, people are much more entrenched in their belief in the rightness of their side, and it's much more difficult to imagine the differing sides agreeing on a solution.

To us in the West, the Middle East epitomizes fundamentalism, strict adherence to the literal interpretation of a religious text or dogma.  And, fundamentalism goes hand in hand with terrorism. Fundamentalism is our enemy.

But, it's not just in the Middle East that people are more entrenched in their beliefs about right and wrong. Here in the US we've got the Tea Party dictating what real conservatism is, we've got militiamen in Oregon, and homegrown 'terrorists' demanding whatever they're demanding. We've got a Congress that agrees only to disagree. Dare I say it, even the 'new atheists' are fundamentalists. Indeed, compromise has become a dirty word, immoral even. In so many ways, moderation, the ability to see more than one side of an issue, has lost its way.

Ken's view is that in a world in which fundamentalists are now our enemy, we've all become fundamentalists; we know what we believe, we hold to those beliefs without question, and we have no respect for the other side.  If there is a strongly ideological force that you disagree with or that threatens you, it pushes you toward an equal and opposite ideology. You listen only to Fox News or MSNBC, you turn off the radio when Trump comes on, or when Clinton comes on, depending on your predilection -- or you're waiting for the Libertarian candidate to be selected, and there's no way you'll listen to anyone else.

If you're here reading this it's likely that you've also picked a side in the nature/nurture 'debate'; 'genetic determinism' either nicely describes your view of biology, or you're very uncomfortable with the term. Genes will or will not be found 'for' most traits, including behaviors, and diseases will or won't be predictable once we've all got our genomes on a CD.

We've said this before but it's worth repeating.  In 1926, one of the great early geneticists, Thomas Hunt Morgan, wrote this about stature:
A man may be tall because he has long legs, or because he has a long body, or both. Some of the genes may affect all parts, but other genes may affect one region more than another. The result is that the genetic situation is complex and, as yet, not unraveled. Added to this is the probability that the environment may also to some extent affect the end-product.
                                  (TH Morgan, The Theory of the Gene, p 294, 1926):
Morgan would be totally comfortable with the recent GWAS results showing that there are hundreds if not thousands of genes that contribute to stature, as well as environmental factors.  He'd agree that complex traits, like stature, or many diseases (including schizophrenia, which Ken will talk about tomorrow) are polygenic, with some environmental effect.  This has been known for almost a century. So why are people still looking for genes (meaning single genes, or a few genes with individually strong effects) 'for' type 2 diabetes, or heart disease, or stature, or schizophrenia?  Why don't we still know what Morgan knew so long ago?

Because sometimes it's not true.  Sometimes there are single genes whose variants are by themselves responsible for traits, including disease.  Starting in the early 1980's, the role of single genes in various traits began to be discovered; oncogenes, Huntington's, cystic fibrosis, breast cancer, and a whole host of single-gene pediatric diseases, and normal traits as well, like blood types, eye color and so on.  There are now about 6000 rare diseases for which genetic causation appears to be known, or at least claimed.   This history of successes mislead, we would say, geneticists, and others, into assuming they could always expect to find 'the' gene for this and 'the' gene for that.  In essence it is still the informal working model, in the back of geneticists' heads, that everything segregates like Mendel's pea traits.

We can and do have both -- single-gene traits and complex traits due to many genes, or many genes and environmental factors too.  Indeed, there are also traits that are completely environmental -- look at the havoc Zika virus seems to be wreaking, with apparently no help from genes, even if close examination might find some people to be slightly more immune than others. Most viruses are like that.

So, it's curious that even the field of genetics has its fundamentalists.  Every time Ken and I write about complexity, or insufficient understanding of disease causation, or question how we know what we think we know, someone will send us a link to a paper that shows we're wrong because autism, or schizophrenia, or intelligence, or whatever their favorite trait, has been shown to be clearly genetic. Genes, with names, have been found to explain it. Sometimes the comments are so emotionally unrestrained that you'll never see them because we don't publish them.

And, we'll often or even typically look at the paper and realize that we've been reprimanded by a fundamentalist yet again. Autism, schizophrenia, heart disease, stature, intelligence, and so on are just not yet predictable from genes, and, we believe, are unlikely ever to be for reasons we write about all the time.  Ken will discuss the new Nature paper on schizophrenia tomorrow, a paper that got huge amounts of press for finally beginning to explain the disease.  Yes, a paper someone offered to send us when they disapproved of a post Ken had written about the difficulties of predicting disease, proving he was wrong. Which, good as that paper may be, is not the case.

I think if Morgan were to come back to the modern field of genetics, he'd feel as Sami Tamimi did returning to Jerusalem.  I think he'd be nostalgic for his era, when fundamentalists didn't rule the field, geneticists weren't prisoners of Mendel, ideologues who know what they'd find before they even looked. Where, even if things seem rosier in retrospect, and certainly people had preferred views and were not always nice to each other, there was more agreement that things were not yet clearly understood, and complexity was not a dirty word.  I think Morgan would appreciate that some traits are explained more simply than others, but that even those aren't 'simple' -- there are more than 2000 alleles in the CFTR gene that seem to be associated with cystic fibrosis, and this kind of complexity is true of most 'simple' traits.

So, why did the field lose this understanding, and take a turn to fundamentalism?  The answer isn't just that we're in a fundamentalist age, of course.  That it's a lot easier to sell the search for a causal gene than a search for.....we're not really sure what, is a large part of the problem.  But, as a friend says, we should be looking for the molasses that explains biological complexity, that connects causal pathways and processes, which ain't just gonna be a gene, or an environmental risk factor.  It's going to be something we don't yet understand, and continuing to look for 'the' gene for your favorite complex trait is only going to slow down the search.  Acknowledging that what we've learned, and confirmed over and over again since Mendel was rediscovered in 1900, is that most traits are complex -- and unpredictable -- is a crucial step.

Thursday, January 28, 2016

The delicious smell of eggs!

Paleontologists like to give names, often self-serving names, to new fossil specimens they unearth. In  part, they want to control the agenda, the species and hence evolutionary track they are revealing (for the first time, naturally!).  One naturally wants to be known as the person who discovered Hobjob Man (Homo hobjobensis).

Well, geneticists are people, too, with all the vanities that accompany that distinction.  They want to name their genes and show their insight.  That's why we have names like 'BRCA' for the 'breast cancer' genes, and countless other examples.  In fact, BRCA1 is, on current best understanding, a general-use, widely expressed gene whose coded protein is used to detect certain types of DNA mutations in the cell, mismatches (non-complementarity) between opposite nucleotides at the corresponding location on the two strands of the DNA molecule) and help fix them.  It is not the, or even a, gene 'for' breast cancer!  It received its name because mutations in the gene were discovered being transmitted among victims of breast cancer in large families. Once identified, risk associated with the gene could be documented without needed to track it in families.   Proper gene-naming should describe the chromosomal location or normal function, where known of a gene, not why or how it was discovered, and not suggesting that its purpose is to cause disease.  Even the discovery-based labeling is risky because genes often if not typically serve multiple functions.

Humorous names like 'sonic hedgehog' are not informative but at least not misleading. One interesting example concerns the 'olfactory receptor' or OR genes.  These genes code for a set of cell-surface receptor proteins, part of a larger family of such genes, that were found in the olfactory (odor-detecting) tissues, such as the lining of the nose in vertebrates like mice and humans.  There is a huge family of such genes, about 1000 in mammals, that have arisen of the eons by gene duplication (and deletion) events.  Our genomes have isolated OR genes and also many clusters, of a few or up to hundreds of adjacent OR genes.  These arose by gene duplication events (and some were lost by inaccurate DNA copying that chopped off parts of the gene), so the number of active and inactive current and former OR genes are included, varying somewhat in each of our genomes.

Big arrays of genes like these often are inaccurately duplicated when cells divide, including during the formation of sperm and egg cells.  The inaccuracy includes mutations that affect the coded OR protein of a given OR gene and hence among the many different OR genes.  This process, over the millennia, generates the huge number and variety of gene family members, of which the OR family is the largest.  In the case of ORs, the idea has been that, like the immune system, these genes enable us to discriminate among odors--a vital function for survival, finding mates, detecting enemies, and so on.  Because of their high level of sequence diversity, each OR gene's coded protein responds to (can detect) a different set of molecules that might pass through the airways.  This allows us to detect--and remember--specific odors, because the combination of responding ORs is unique to each odor.  Discovery of this clever way by which Nature allows us to discriminate what's in our environment was worthy of a Nobel prize to Richard Axle and Linda Buck in 2004.

The catch is that this only works because each nasal olfactory cell expresses only a single OR gene. How the others are shut off in that cell, but each of them is turned on in other olfactory cells is interesting, but not really understood.  At least, this elaborate system evolved for olfactory discrimination....didn't it?  After all, the genes are named for that!

Well, not so fast.  A recent paper by Flegel at el. in Frontiers in Molecular Biosciences, has looked for OR expression in individual mammal sperm cells.  It has concluded that these genes, on the surface of sperm cells, enable it to find and fertilize eggs.  As described by the authors, sperm cells locate egg cells in the female reproductive tract by various chemosensory/receptor means, in a process not fully understood. Various studies have found OR genes expressed on the surface of sperm cells, where they have been said to be involved in the movement mechanisms of sperm.  These authors checked all known OR genes for expression in human sperm cells (they looked for their RNA transcripts).  91 OR genes were detected as being expressed in this way.  They showed their presence in various sub-cellular compartments in the sperm cells, which may be suggestive of specific functions.

Interestingly, the authors claim they've been leaders in detecting 'ectopically' expressed OR gene transcripts (but they aren't the only people documenting such 'ectopic' expression; see this post from 2012).  Whether this is just transcriptional noise or really functional, the very term 'ectopic' suggests the problem with gene naming.  If they're in sperm cells, they aren't properly named as 'olfactory' receptors.  These authors detected varying numbers of OR genes in different samples.  Some of this can be experimental error, but if it is highly controlled variable expression, serious questions arise.  Many of the transcripts were from the antisense (opposite) DNA strand to the one that actually codes for a protein sequence.

The authors found some systematic locations of specific OR genes in the sperm cells, as shown here:

Localization in sperm of specific Olfactory Receptor genes.  Source: Flegel et al., see text.

The plausibility of these results is quite strange.  It is no surprise whatever to find that genes are used in multiple contexts.  But in this particular case, repeatable findings could mean sloppy transcription, so that actually important genes are near the OR genes and the latter are just transcribed and/or translated with out real function.  Of course, the authors suggest there must be some function because, essentially, of the apparent orderliness of the findings.  Yet this is very hard to understand.

OR genes vary presumably because each variant responds to differing odorant molecules.  With a repertoire of hundreds of genes, and only one expressed per olfactory neuron, we can distinguish, and remember, odors we have experienced.   For similar reasons, the genes are highly mutable--again, that keeps the detectability repertoire up.  Your brain needs to recognize which receptor cells a given odor triggers, in case of future exposure.  But the combination of reporting cells, that is, their specific ORs, shouldn't generally matter so long as the brain remembers.

That eggy aroma!
There is a huge burden of proof here.  Again it is not the multiple expression, but the suggested functions, that seem strange.  If the findings actually have to do with fertilization, what is the role of this apparently random binding-specificity, the basic purportedly olfactory repertoire strategy, of these genes on the sperm cells' surface?  How can a female present molecules that are specifically recognized by this highly individualistic OR repertoire in the male?  How can her egg cell or genital tract or whatever, present detectable molecules for the sperm to recognize? What is it that guides or attracts them, whose specificity is retained even though the OR genes themselves are so highly variable?

And of course one has to be exceedingly skeptical about antisense OR-specific RNAs having any function, if that is what is proposed.  It is more than hard to imagine what that might be, how it would work, or most importantly, how it would have evolved.  Is this a report of really striking new genetic mechanisms and evolution....or findings not yet clear between function and noise?

The mechanism is totally unclear at present, and the burden of proof a major one.  Given that others have reported that OR genes are expressed in other cells, the evidence suggests that such expression is clearly believable, whatever the reason. Indeed, years ago it was speculated that they might serve to identify body cells with unique OR-based 'zip codes' for various internal use as we recognize which cell is in which tissue and the like.

Sperm- and/or testis-specific expression of at least some OR genes has also been observed before, as these authors note, but with less extensive characterization.  Is it functional, or just sloppy genome usage?  Time will tell.  The sperm cells are programmed to know the delicious smell of freshly prepared eggs.  Now, perhaps the next check should be to see whether the same sperm cells are also looking for (or lured by) the aroma of freshly fried bacon!

But if it is another use of a specific cell-identification system, of which olfactory discrimination is but one use, then it will be consistent with the well-known opportunistic nature of evolution.  There are countless precedents.  How this one evolved will be interesting to know and, perhaps especially, to learn whether olfaction was its initial use, or one adopted after some earlier--perhaps fertilization-related--function had already evolved.

But for our purposes today, the clear lesson, at least, should be the problem of coining gene names inaptly assigned because of their first-discovered function (or, in our view, because some whimsical geneticist liked a particular movie or cartoon character, like Sonic Hedghog).

Wednesday, January 27, 2016

"The Blizzard of 2016" and predictability: Part III: When is a health prediction 'precise' enough?

We've discussed the use of data and models to predict the weather in the last few days (here and here).  We've lauded the successes, which are many, and noted the problems, including people not heeding advice. Sometimes that's due, as a commenter on our first post in this series noted, to previous predictions that did not pan out, leading people to ignore predictions in the future.  It is the tendency of some weather forecasters, like all media these days, to exaggerate or dramatize things, a normal part of our society's way of getting attention (and resources).

We also noted the genuine challenges to prediction that meteorologists face.  Theirs is a science that is based on very sound physics principles and theory, that as a meteorologist friend put it, constrain what can and might happen, and make good forecasting possible.  In that sense the challenge for accuracy is in the complexity of global weather dynamics and inevitably imperfect data, that may defy perfect analysis even by fast computers.  There are essentially random or unmeasured movements of molecules and so on, leading to 'chaotic' properties of weather, which is indeed the iconic example of chaos, known as the so-called 'butterfly effect': if a butterfly flaps its wings, the initially tiny and unseen perturbation can proliferate through the atmosphere, leading to unpredicted, indeed, wildly unpredictable changes in what happens.
The Butterfly Effect, far-reaching effects of initial conditions; Wikipedia, source

Reducing such effects is largely a matter of needing more data.  Radar and satellite data are more or less continuous, but many other key observations are only made many miles apart, both on the surface and into the air, so that meteorologists must try to connect them with smooth gradients, or estimates of change, between the observations.  Hence the limited number of future days (a few days to a week or so) for which forecasts are generally accurate.

Meteorologists' experience, given their resources, provide instructive parallels as well as differences with biomedical sciences, that aim for precise prediction, often of things decades in the future, such as disease risk based on genotype at birth or lifestyle exposures.  We should pay attention to those parallels and differences.

When is the population average the best forecast?
Open physical systems, like the atmosphere, change but don't age.  Physical continuity means that today is a reflection of yesterday, but the atmosphere doesn't accumulate 'damage' the way people do, at least not in a way that makes a difference to weather prediction.  It can move, change, and refresh, with a continuing influx and loss of energy, evaporation and condensation, and circulating movement, and so on. By contrast, we are each on a one-way track, and a population continually has to start over with its continual influx of new births and loss to death. In that sense, a given set of atmospheric conditions today has essentially the same future risk profile as such conditions had a year or century or millennium ago. In a way, that is what it means to have a general atmospheric theory. People aren't like that.

By far, most individual genetic and even environmental risk factors identified by recent Big Data studies only alter lifetime risk by a small fraction.  That is why the advice changes so frequently and inconsistently.  Shouldn't it be that eggs and coffee either are good or harmful for you?  Shouldn't a given genetic variant definitely either put you at high risk, or not? 

The answer is typically no, and the fault is in the reporting of data, not the data themselves. This is for several very good reasons.  There is measurement error.  From everything we know, the kinds of outcomes we are struggling to understand are affected by a very large number of separate causally relevant factors.  Each individual is exposed to a different set or level of those factors, which may be continually changing.  The impact of risk factors also changes cumulatively with exposure time--because we age.  And we are trying to make lifetime predictions, that is, ones of open-ended duration, often decades into the future.  We don't ask "Will I get cancer by Saturday?", but "Will I ever get cancer?"  That's a very different sort of question.

Each person is unique, like each storm, but we rarely have the kind of replicable sampling of the entire 'space' of potentially risk-affecting genetic variants--and we never will, because many genetic or even environmental factors are very rare and/or their combinations essentially unique, they interact and they come and go.  More importantly, we simply do not have the kind of rigorous theoretical basis that meteorology does. That means we may not even know what sort of data we need to collect to get a deeper understanding or more accurate predictive methods.

Unique contributions of combinations of a multiplicity of risk factors for a given outcome means the effect of each factor is generally very small and even in individuals their mix is continually changing.  Lifetime risks for a trait are also necessarily averaged across all other traits--for example, all other competing causes of death or disease.  A fatal early heart attack is the best preventive against cancer!  There are exceptions of course, but generally, forecasts are weak to begin with and in many ways over longer predictive time periods they will simply approximate the population--public health--average.  In a way that is a kind of analogy with weather forecasts that, beyond a few days into the future, move towards the climate average.

Disease forecasts change peoples' behavior (we stop eating eggs or forego our morning coffee, say), each person doing so, or not, to his/her own extent.  That is, feedback from the forecast affects the very risk process itself, changing the risks themselves and in unknown ways.  By contrast, weather forecasts can change behavior as well (we bring our umbrella with us) but the change doesn't affect the weather itself.

Parisians in the rain with umbrellas, by Louis-Léopold Boilly (1803)

Of course, there are many genes in which variants have very strong effects.  For those, forecasts are not perfect but the details aren't worth worrying about: if there are treatments, you take them.  Many of these are due to single genes and the trait may be present at birth. The mechanism can be studied because the problem is focused.  As a rule we don't need Big Data to discover and deal with them.  

The epidemiological and biomedical problem is with attempts to forecast complex traits, in which most every instance is causally unique.  Well, every weather situation is unique in its details, too--but those details can all be related to a single unifying theory that is very precise in principle.  Again, that's what we don't yet have in biology, and there is no really sound scientific justification for collecting reams of new data, which may refine predictions somewhat, but may not go much farther.  We need to develop a better theory, or perhaps even to ask whether there is such a formal basis to be had--or is the complexity we see is just what there is?

Meteorology has ways to check its 'precision' within days, whereas biomedical sciences have to wait decades for our rewards and punishments.  In the absence of tight rules and ways to adjust errors, constraints on biomedical business as usual are weak.  We think a key reason for this is that we must rely not on externally applied theory, but internal comparisons, like cases vs controls.  We can test for statistical differences in risk, but there is no reason these will be the same in other samples, or the future.  Even when a gene or dietary factor is identified by such studies, its effects are usually not very strong even if the mechanism by which they affect risk can be discovered.  We see this repeatedly, even for risk factors that seemed to be obvious.

We are constrained not just to use internal comparisons but to extrapolate the past to the future.  Our comparisons, say between cases and controls, are retrospective and almost wholly empirical rather than resting on adequate theory.  The 
'precision' predictions we are being promised are basically just applications of those retrospective findings to the future.  It's typically little more than extrapolation, and because risk factors are complex and each person is unique, the extrapolation largely assumes additivity: that we just add up the risk estimates for various factors that we measured on existing samples, and use that sum as our estimate of future risk.  

Thus, while for meteorology, Big Data makes sense because there is strong underlying theory, in many aspects of biomedical and evolutionary sciences, this is simply not the case, at least not yet.  Unlike meteorology, biomedical and genetic sciences are the really harder ones!  We are arguably just as likely to progress in our understanding by accumulating results from carefully focused questions, where we're tracing some real causal signal (e.g., traits with specific, known strong risk factors), as by just feeding the incessant demands of the Big Data worldview.  But this of course is a point we've written (ranted?) about many times.

You bet your life, or at least your lifestyle!
If you venture out on the highway despite a forecast snowstorm, you are placing your life in your hands.  You are also imposing dangers on others (because accidents often involve multiple vehicles). In the case of disease, if you are led by scientists or the media to take their 'precision' predictions too seriously, you are doing something similar, though most likely mainly affecting yourself.  

Actually, that's not entirely true.  If you smoke or hog up on MegaBurgers, you certainly put yourself at risk, but you risk others, too. That's because those instances of disease that truly are strongly and even mappably genetic (which seems true of subsets of even of most 'complex' diseases), are masked by the majority of cases that are due to easily avoidable lifestyle factors; the causal 'noise' that risky lifestyles make genetic causation harder to tease out.

Of course, taking minor risks too seriously also has known potentially serious consequences, such as of intervening on something that was weakly problematic to begin with.  Operating on a slow-growing prostate or colon cancer in older people, may lead to more damage than the cancer will. There are countless other examples.

Life as a Garden Party
The need is to understand weak predictability, and to learn to live with it. That's not easy.

I'm reminded of a time when I was a weather officer stationed at an Air Force fighter base in the eastern UK.  One summer, on a Tuesday morning, the base commander called me over to HQ.  It wasn't for the usual morning weather briefing.....

"Captain, I have a question for you," said the Colonel.

"Yes, sir?"

"My wife wants to hold a garden party on Saturday.  What will the weather be?"

"It might rain, sir," I replied.

The Colonel was not very pleased with my non-specific answer, but this was England, after all!

And if I do say so myself, I think that was the proper, and accurate, forecast.**

Plus ça change..  Rain drenches royal garden party, 2013; The Guardian

**(It did rain.  The wife was not happy! But I'd told the truth.)