Search This Blog


Monday, November 22, 2010

Locating and visualizing minority non-European admixtures across our genomes

Model-based genetic ancestry estimation algorithms, like ADMIXTURE and STRUCTURE, have some limitations. As I see it, their main problems are that they a) can't tell the difference between more recent admixture and very ancient clines in genetic diversity b) can't pick up admixture that comes in the form of one or two very small segments, and c) don't show where in the genome the admixture is found.

So today I'd like to offer a solution; a simple and lightweight, but very clever program combo called RHHcounter/RHHmapper (see bottom of the post for details).

Imagine, for example, a white American carrying a couple of tiny segments of West African origin, from an ancestor who lived 250 years ago, and an eastern Finn with no Asian ancestors in the last 4000 years or more. If we run an inter-continental ADMIXTURE analysis with these two, it's very likely the American will score 100% European, while the eastern Finn will probably come out around 9% North and East Asian due to really old Uralic influence.

That sort of thing isn't an issue when comparing the genetic structure of populations, including their ancient admixtures. Eastern Finns are indeed genetically closer to North Asians compared to white Americans, and that's basically what ADMIXTURE is picking up on. However, if the focus is on the individual, this is likely to be a problem. Our hypothetical American might be aware of that African ancestor, with solid paperwork backing up their genealogical connection, but he's pulling his hair out because nothing's showing up via genetic tests.

So let's take a look at a real life example of how RHHcounter can pick up segments of potentially recent Sub-Saharan African origin...

I put together a data set of over 350 samples that showed less than 2% non-West Eurasian influence in various ADMIXTURE analyses, and clustered in or very near Europe on MDS plots. I then let RHHcounter search these samples for genotypes with less than 0.005% frequency amongst them. The samples originating from North of the Alps and Carpathians scored 5-15 heterozygote hits each, usually widely dispersed around the genome. However, in a few Americans apparently of North European descent, the heterozygotes took the form of small segments.

Don Conrad of Genomes Unzipped, who's raw data I recently
co-opted into my project, also showed a couple of such tiny segments, despite coming out 100% European in an ADMIXTURE run. These were located on chromosomes 7 and 13, and marked by just four SNPs each. You can see them on his RHHmapper Chromosome Mosaic below.

Admittedly, these don't seem like much, but in the context of my analysis, with the particular samples and thresholds I used, they do look relatively unusual for someone of Northern European origin. Indeed, in Don's case, the SNPs they contain also show heterozygote genotypes that appear distinctly Sub-Saharan African. I checked their characteristics via the very handy tools at the
SPSmart and dbSNP websites.

Just for comparison, below is another Genomes Unzipped Chromosome Mosaic. This one belongs to Daniel MacArthur, and it also shows a few hits. But these heterozygotes are spread around the genome in a fairly random way, and don't appear to form segments. So as far as I can tell, it's much harder to make a case for relatively recent non-European admixture in Daniel's case.

Some of my European project members have already received their Chromosome Mosaics, and I plan to send out many more in the coming weeks. I recommend that everyone checks out their results carefully, without jumping to conclusions. Look for hits that form sizeable segments, and indeed, much more sizeable than Don's. If you do find any, study the genotypes within these at
SPSmart and dbSNP, as per above, to possibly characterize their biogeographic origins. Also, keep in mind that factors like genetic diversity might be a factor - for instance, Southern Europeans have a lot more genetic diversity than Northern Europeans, so they might show more hits on their Chromosome Mosaics.


We describe a novel approach for evaluating SNP genotypes of a genome-wide association scan to identify “ethnic outlier” subjects whose ethnicity is different or admixed compared to most other subjects in the genotyped sample set. Each ethnic outlier is detected by counting a genomic excess of “rare” heterozygotes and/or homozygotes whose frequencies are low (less than 1%) within genotypes of the sample set being evaluated. This method also enables simple and striking visualization of non-Caucasian chromosomal DNA segments interspersed within the chromosomes of ethnically admixed individuals. We show that this visualization of the mosaic structure of admixed human chromosomes gives results similar to another visualization method (SABER) but with much less computational time and burden. We also show that other methods for detecting ethnic outliers are enhanced by evaluating only genomic regions of visualized admixture rather than diluting outlier ancestry by evaluating the entire genome considered in aggregate. We have validated our method in the Wellcome Trust Case Control Consortium (WTCCC) study of 17,000 subjects as well as in HapMap subjects and simulated outliers of known ethnicity and admixture. The method's ability to precisely delineate chromosomal segments of non-Caucasian ethnicity has enabled us to demonstrate previously unreported non-Caucasian admixture in two HapMap Caucasian parents and in a number of WTCCC subjects. Its sensitive detection of ethnic outliers and simple visual discrimination of discrete chromosomal segments of different ethnicity implies that this method of rare heterozygotes and homozygotes (RHH) is likely to have diverse and important applications in humans and other species.

Ralph E. McGinnis,
Visualizing Chromosome Mosaicism and Detecting Ethnic Outliers by the Method of “Rare” Heterozygotes and Homozygotes (RHH) , Human Molecular Genetics, 2010, Vol. 19, No. 13 2539–2553, doi:10.1093/hmg/ddq102


Don said...

Very interesting - thanks for sharing this!

I'm not aware of anything in my family history that would explain this, but you would have to go back quite a way to find evidence, I think. Using the crude analysis that 12/550K = 2.2 x 10^-5 "recent" African ancestry, and assuming that there was one person with 100% African ancestry in my pedigree, that admixture event must have occurred 15 or 16 generations ago. My records don't go back that far on either side of the family.

However I do have my parents' SNP data and would keen to see if all the ancestry is inherited from one side of the family or the other - it would be more compelling from my viewpoint. Can you tell me the SNP IDs for the variants involved in these stretches?

Davidski said...

Hi Don, these are the SNPs that were flagged during two different runs...

rs17022588 3 29094697 CT
rs1404828 7 144239436 GT
rs1917664 7 144311616 AG
rs7778508 7 145534795 AG
rs6464744 7 145604377 CT
rs1886324 13 42782680 AG
rs17064703 13 43019084 CT
rs12017585 13 44202470 AG
rs7139821 13 44207025 GT

So there appear to be two, very small segments of African origin there.

If you mail me your parents' zip files I can put them in the next run. That could be an interesting experiment.


pconroy said...

Absolutely fascinating stuff!

David, did you get a chance to look at my data, or my parents? My father is showing an Assyrian relative on HIR Search - as am I - and he also has a bunch of Russian Jews, and a Chuvash segment, so I'd be very interested in knowing more...

Mike B said...

You can maybe try to use the software HAPMIX to "c) show where in the genome the admixture is found."

Davidski said...


Yes, I mailed you the Chromosome Mosaics and relevant SNPs of your parents. But I'll be running the tests again using different thresholds, so if you didn't get the first round results look out for the next set.

Mike B,

HAPMIX isn't as sensitive as RHHcounter, from what I've seen myself anyway.

Larry said...

Re: your latest release of Phased Haploblock Analysis.

1. In your range of SNPs data columns labled "START SNP inclusive" you have 0 values but "rs0" not exist. Should we really add 1 to this number to find the location of the Start SNP in the list?

1. In your range of SNPs data column labled "START SNP, exclusive" does this mean that the SNP is NOT in the match and that the actual SNP is located one position behind the Position indicated and we should deduct 1 from the listed position to find the last SNP in the match?

Davidski said...

Hi Larry,

0 is the first SNP on the SNP list. So for Chromosome 1 that is rs3934834.

Also, the start SNP tags are always inclusive. What this means is that if a start SNP tag says 0, it's referring to the first SNP on the SNP list, which is rs3934834.

On the other hand, the end SNP tags are always exclusive. What this means is that when an end SNP tag for Chromosome 1 says, for example, 500, it's actually saying that the haplotype ends at SNP tag 499.

In other words, for Chromosome 1, if start = 0 and end =500, then the haplotype runs from 1 to 499, or rs3934834 to rs4654492.

That's just how the output is presented, and I can't really change it. But I might number the SNPs next time to make things clearer.

Larry said...

Thanks for the explanation. That is what I first thought but was not sure.
I have posted my results as a frequency plot of the matches in two plots. First for all the matches and second for just the matches of 100 or less. You might be interested in the results. See my posted results here.

Larry said...


I am still analyzing your latest IBD data. The question I have is what is the significance of the entries with a 53 after in the ID? Are they some kind of composite for the type?
I am finding all of them at the top of my matching list by counting matches in common.

here are the results ranked by no of matches to me.
LT53 7

Larry said...

I found this article which may be of use to your project.

Using the same methodologies with which researchers constructed the HGDP-CEPH Human Genome Diversity Cell Line Panel, Pemberton and his colleagues assembled two subsets of individuals — dubbed HAP1161 and HAP1117 — that contain no known pairs of individuals with a first-degree relationship and no known pairs of individuals with a relationship closer than that of first cousins, respectively. Pemberton is now using these subsets in his investigations of genome-wide homozygosity across human populations worldwide.