Search This Blog

Wednesday, July 22, 2015

Marker overlap and test accuracy

A few people are asking me about the effects of marker overlap or genotype rate on test accuracy. Logic dictates that the better the overlap, the more accurate the results, but this isn't strictly true. Here's what I've learned over the years:

- accuracy doesn't necessarily improve with higher marker overlap, it improves (up to a certain point) with more markers

- you will still see accurate results using as little as 25,000 SNPs, as long as the test doesn't suffer from any serious problems

- poorly designed tests, such as those based on less than 1000 reference samples, always produce garbage results no matter what the marker overlap

In other words, a well designed test based on 200,000 SNPs will produce very accurate results for a genotype file with a marker overlap of 50%. On the other hand, another well designed test, based on just 50,000 SNPs, is likely to produce less accurate results for a genotype file with a marker overlap of 100%.

So how can you tell a well designed test from a poorly designed one? It's easy, just have a look at the results they're producing for people with less complex ancestry. For instance, ask a Lithuanian, Swede or Pole what they're seeing at the top of their oracles. Is the Swede seeing Swedish or, say, German? If the answer is German instead of Swedish, or at least some type of Scandinavian, then the test is garbage and best ignored.

By the way, the recent Allentoft et al. paper on the ancient genomics of Eurasia includes a useful discussion on the effects of missing markers on the accuracy of both ADMIXTURE and PCA results. Refer to section 6.2 in the freely available supplementary info PDF here.

Tuesday, May 12, 2015

4mix: four-way mixture modeling in R

Thanks to Eurogenes project member DESEUK1. A zip file with the R script, instructions and a couple of data sheets is available here.

So let's model Poles as a bunch of ancient genomes from Central and Eastern Europe using output from my K8 analysis.

Copy & Paste: source('4mix.r')


Copy & Paste: getMix('K8avg.csv', 'target.txt', 'HungaryGamba_EN', 'HungaryGamba_HG', 'Karelia_HG', 'Corded_Ware_LN')


After a few seconds you should see the results...

Target = 19% HungaryGamba_EN + 14% HungaryGamba_HG + 2% Karelia_HG + 65% Corded_Ware_LN @ D = 0.0062

Obviously the script can use ancestry proportions and/or population averages from any test, provided they're formatted properly. The accuracy of the modeling will depend on the quality of the input.

Update 19/05/2015: A new version of the 4mix script that can run multiple targets is available here, courtesy of Open Genomes.