Thursday, November 25, 2010
The Sardinian samples from the HGDP are always, as far as I know, classified as entirely of West Eurasian origin via clustering algorithms like ADMIXTURE and STRUCTURE. In other words, these Sardinians completely fit into clusters that peak north of the Sahara and west of Central Asia. So it would appear that gene flow to Sardinia from neighbouring Africa has been minimal, or even non-existent.
But that's not what I found when I took a closer look at their genomes, as well as those of over 250 other Europeans with apparently no extra-European ancestry, as shown by my own ADMIXTURE analyses. I picked one of my favorite "local admixture" programs for the job, called RHHcounter (see here for more details), setting the rare genotype detection level at 0.01%.
Quite a few of the individuals showed tiny clusters of 3-4 genotypes that were only common outside of Europe, usually in Africa, East Asia or the Americas. These were often too small to investigate further. However, I spotted two segments that were large and clear enough to warrant more detailed analyses. Surprisingly, these belonged to two of the HGDP Sardinians - HGDP00672 and HGDP00673. Below are their Chromosome Mosaics, courtesy of RHHmapper, along with MDS plots based on all the SNPs from the aforementioned segments (marked by arrows). The MDS plots include samples from Europe, North Africa and Sub-Saharan Africa.
As per above, the MDS plots were produced using all the genotypes contained within the relevant segments (over 300 and 2000 SNPs respectively), and not just those that were detected by RHHcouter in the analysis. Obviously, what this shows is that only a fraction of the extra-European genotypes were flagged, while the rest nearby remained undetected at this threshold.
I can't see any explanation for these results other than relatively recent gene flow from Sub-Saharan Africa to Sardinia. What this means, of course, is that there must be a reason why model-based algorithms can't pick up such admixtures in certain samples. As suggested by the authors of RHHcounter, perhaps the segments are too small and/or contain too few SNPs to have an impact on overall ancestry estimation? However, I also suspect that because Sardinia is something of a Southern European genetic isolate, the Sardinians are too easily classified as Europeans by ADMIXTURE, STRUCTURE etc., which might mask at least some of their minority admixtures.
Wednesday, November 24, 2010
I thought I'd reiterate in more detail how it's possible to get the most out of the RHHcounter/RHHmapper data I sent out this week. First of all, please read carefully the journal article I linked to, which has a thorough description of how the authors found previously unreported Sub-Saharan African admixture in two European American HapMap samples from Utah.
Secondly, it's very important to realize that these kinds of tests rely on more than just isolated matches to show something meaningful. In other words, one SNP hit with one, two or three other project members doesn't mean much in this context. You need multiple SNPs that show very similar genotype frequencies in tens or hundreds of samples. As mentioned previously, the SPSmart website is a very good resource for that sort of thing.
Tick the "CEPH U. Stanford HGDP" box and press "metasearch" > Tick all the continental boxes (ie. Africa, America, etc.) > Paste your SNP (or SNPs) into the "SEARCH BY SNPS" text box > Press next > Press search
That should give you a set of pie charts and figures showing the frequency of the alleles for that SNP in 7 biogeographic zones. Like this...
And you can double check these results at the dbSNP website by clicking on the SNP in question @ SPSmart...
OK, so if you're GG or GT for that particular SNP, it's pretty obvious this could be a sign of African ancestry. But, as per above, that doesn't mean much by itself. Check whether there are other SNPs being flagged in the same area with similarly African specific results, and indeed whether there's a sign of a possible segment of African origin there. The more SNPs, and the clearer the segment, the more reliable the result.
It's also possible to get an idea of the percentage of your genome covered by this segment by checking its start and end points in base pairs, and then working out its size in Mbs (FYI, the human genome is about 3000 Mb in size). But that would just be a rough guide, because the size of the segment might change with a different threshold for rare genotype detection. For example, if I specify that all genotypes with an incidence of less than 0.005% are flagged, then that might show up a smaller segment then if I go down to 0.05%. Oh, and it might also pay to check the recombination rate for that area of genome, because if it's in a so called cold spot, then that segment could be old...really old.
Monday, November 22, 2010
Model-based genetic ancestry estimation algorithms, like ADMIXTURE and STRUCTURE, have some limitations. As I see it, their main problems are that they a) can't tell the difference between more recent admixture and very ancient clines in genetic diversity b) can't pick up admixture that comes in the form of one or two very small segments, and c) don't show where in the genome the admixture is found.
So today I'd like to offer a solution; a simple and lightweight, but very clever program combo called RHHcounter/RHHmapper (see bottom of the post for details).
Imagine, for example, a white American carrying a couple of tiny segments of West African origin, from an ancestor who lived 250 years ago, and an eastern Finn with no Asian ancestors in the last 4000 years or more. If we run an inter-continental ADMIXTURE analysis with these two, it's very likely the American will score 100% European, while the eastern Finn will probably come out around 9% North and East Asian due to really old Uralic influence.
That sort of thing isn't an issue when comparing the genetic structure of populations, including their ancient admixtures. Eastern Finns are indeed genetically closer to North Asians compared to white Americans, and that's basically what ADMIXTURE is picking up on. However, if the focus is on the individual, this is likely to be a problem. Our hypothetical American might be aware of that African ancestor, with solid paperwork backing up their genealogical connection, but he's pulling his hair out because nothing's showing up via genetic tests.
So let's take a look at a real life example of how RHHcounter can pick up segments of potentially recent Sub-Saharan African origin...
I put together a data set of over 350 samples that showed less than 2% non-West Eurasian influence in various ADMIXTURE analyses, and clustered in or very near Europe on MDS plots. I then let RHHcounter search these samples for genotypes with less than 0.005% frequency amongst them. The samples originating from North of the Alps and Carpathians scored 5-15 heterozygote hits each, usually widely dispersed around the genome. However, in a few Americans apparently of North European descent, the heterozygotes took the form of small segments.
Don Conrad of Genomes Unzipped, who's raw data I recently co-opted into my project, also showed a couple of such tiny segments, despite coming out 100% European in an ADMIXTURE run. These were located on chromosomes 7 and 13, and marked by just four SNPs each. You can see them on his RHHmapper Chromosome Mosaic below.
Admittedly, these don't seem like much, but in the context of my analysis, with the particular samples and thresholds I used, they do look relatively unusual for someone of Northern European origin. Indeed, in Don's case, the SNPs they contain also show heterozygote genotypes that appear distinctly Sub-Saharan African. I checked their characteristics via the very handy tools at the SPSmart and dbSNP websites.
Just for comparison, below is another Genomes Unzipped Chromosome Mosaic. This one belongs to Daniel MacArthur, and it also shows a few hits. But these heterozygotes are spread around the genome in a fairly random way, and don't appear to form segments. So as far as I can tell, it's much harder to make a case for relatively recent non-European admixture in Daniel's case.
Some of my European project members have already received their Chromosome Mosaics, and I plan to send out many more in the coming weeks. I recommend that everyone checks out their results carefully, without jumping to conclusions. Look for hits that form sizeable segments, and indeed, much more sizeable than Don's. If you do find any, study the genotypes within these at SPSmart and dbSNP, as per above, to possibly characterize their biogeographic origins. Also, keep in mind that factors like genetic diversity might be a factor - for instance, Southern Europeans have a lot more genetic diversity than Northern Europeans, so they might show more hits on their Chromosome Mosaics.
We describe a novel approach for evaluating SNP genotypes of a genome-wide association scan to identify “ethnic outlier” subjects whose ethnicity is different or admixed compared to most other subjects in the genotyped sample set. Each ethnic outlier is detected by counting a genomic excess of “rare” heterozygotes and/or homozygotes whose frequencies are low (less than 1%) within genotypes of the sample set being evaluated. This method also enables simple and striking visualization of non-Caucasian chromosomal DNA segments interspersed within the chromosomes of ethnically admixed individuals. We show that this visualization of the mosaic structure of admixed human chromosomes gives results similar to another visualization method (SABER) but with much less computational time and burden. We also show that other methods for detecting ethnic outliers are enhanced by evaluating only genomic regions of visualized admixture rather than diluting outlier ancestry by evaluating the entire genome considered in aggregate. We have validated our method in the Wellcome Trust Case Control Consortium (WTCCC) study of 17,000 subjects as well as in HapMap subjects and simulated outliers of known ethnicity and admixture. The method's ability to precisely delineate chromosomal segments of non-Caucasian ethnicity has enabled us to demonstrate previously unreported non-Caucasian admixture in two HapMap Caucasian parents and in a number of WTCCC subjects. Its sensitive detection of ethnic outliers and simple visual discrimination of discrete chromosomal segments of different ethnicity implies that this method of rare heterozygotes and homozygotes (RHH) is likely to have diverse and important applications in humans and other species.
Ralph E. McGinnis, Visualizing Chromosome Mosaicism and Detecting Ethnic Outliers by the Method of “Rare” Heterozygotes and Homozygotes (RHH) , Human Molecular Genetics, 2010, Vol. 19, No. 13 2539–2553, doi:10.1093/hmg/ddq102