Search This Blog

Loading...

Tuesday, March 22, 2011

Reconstructing the Ancestral North Indian (ANI) genome


Back in 2009, Reich et al. theorized that the current South Asian gene pool was basically made up of two founding genetic components; Ancestral North Indian (ANI), and Ancestral South Indian (ASI). The distilled ANI, they noted, was more similar to the genomes of modern Northwest Europeans than those of the Adygei from the Caucasus. This is obviously out of whack with geography, but it does make sense based on what I've seen in my experiments on the Pakistani samples from the HGDP. Many of them, especially the Pathans, carry numerous segments, or haploblocks, that basically look North European. This gave me an idea to try and reconstruct the ANI genome based on such fragments. The first chromosome of my composite sample, which I call the "ANI composite" is available for download here. It's a PLINK Ped file in illumina AB format with 19,261 SNPs.

Here are five MDS plots featuring the putative "ANI composite" (dimensons 1&2, 1&3, 1&4, 1&5 and 1&6), obviously NOT INCLUDING the HGDP samples used to make it (see below). Overall, it seems to resemble most closely my reference samples from Hungary, Belarus, Russia and Lithuania. I have to admit that I was very pleased to see it behaving like a set of genotypes from a real human subject across many dimensions of genetic variation. MDS analyses are very sensitive to anomalies, such as unusually long runs of homozygosity, so the fact that my composite can pass for a normal sample on these plots is fantastic.












So how did I do this? Well, it wasn't very difficult, but a bit tedious, so I need a break before continuing. I used information from my earlier experiments with ADMIXMAP, HAPMIX and RHH Counter to locate and delineate North European-like segments in phased Pakistani HGDP samples. I phased the data myself with BEAGLE, in a pool of South Asian and Middle Eastern samples, so as not to bias the results of phasing and imputation towards Northern Europe. In order to keep the alleles in phase when loaded into PLINK, I duplicated the haplotypes, basically producing completely homozygous individuals out of each one. Then I created an ANI composite dummy with 100% no calls, and loaded the haplotypes into this sample with a Python script. The first to load were the Pathan haplotypes, followed by the Burusho. I chose individuals from these two groups to make up the backbone of the putative ANI genome because they always seem to come out most "North European" in my ADMIXTURE and MDS runs compared to other South Asians. The empty spaces were filled with haplotypes from the Brahui and Balochi. Below is a list of all the samples used:

Pathan HGDP00213
Pathan HGDP00214
Pathan HGDP00218
Pathan HGDP00224
Pathan HGDP00241
Pathan HGDP00243
Pathan HGDP00254
Pathan HGDP00258
Pathan HGDP00259
Pathan HGDP00262
Pathan HGDP00264

Burusho HGDP00338
Burusho HGDP00356
Burusho HGDP00364
Burusho HGDP00382
Burusho HGDP00392
Burusho HGDP00412
Burusho HGDP00417
Burusho HGDP00423
Burusho HGDP00428
Burusho HGDP00433

Brahui HGDP00007
Brahui HGDP00009
Brahui HGDP00017
Brahui HGDP00041
Brahui HGDP00047

Balochi HGDP00054
Balochi HGDP00058
Balochi HGDP00062
Balochi HGDP00072


The phased data and the "ANI" haplotypes used in this experiment are available on request from eurogenesblog [at] hotmail [dot] com. I welcome feedback and suggestions on how to improve my methodology. Admittedly, this was a test run, so it's unlikely to be perfect.


Wednesday, March 2, 2011

Fine scale analysis of Eurogenes' Anatolian & European Turks


This is a supervised run with the new ADMIXTURE 1.1, featuring project members from all over Turkey. What does "supervised" mean? It means that the origins of the reference samples have already been flagged by me, and the program then works to match the test samples to these references.

I didn't use a Turkish reference set on purpose, because I already knew that the test samples were Turks, and wanted to see what kind of near and far populations potentially contributed to the modern Anatolian gene pool. But I did include reference Uzbeks and Uygurs for comparison, and it's fascinating to note the high "Central-Eastern European" in these samples. These results were obtained with Hungarians and Romanians as proxies for that component, but other samples from the region, like Poles, work just as well.





Key: Red = Armenian + Georgian (Anatolian & Caucasus), Orange = Cypriot + Greek (Southeastern European), Light Green = North Kannadi + Sakilli + Selected Gujarati (South Asian), Green = Hungarian + Romanian (Central-Eastern European), Aqua = Han Chinese (East Asian), Blue = Iranian, Dark Blue = Mandenka + Bantu (Sub-Saharan African), Purple = Palestinian + Saudi Arabian (Arabian), Pink = Evenk + Nganassan + Yakut (Siberian). See spreadsheet for details.

Update: I'm getting a lot of comments in a forum discussion about the variable spread of some of these components in the test samples. It's really difficult to answer why one Turkish sample might show 0% membership in a cluster, while another more than 20%. But I just ran an MDS of various sets to see whether some of these patterns could be reproduced visually, and yes they can. Note that TR3, who comes out 100% red (Armenian + Georgian), clusters with Armenians and Georgians on these plots. On the other hand, TR10, shows no blue (Iranian) in the ADMIXTURE analysis, and backs that up with her behavior on the MDS plots, by pulling away form the Iranians towards the Uzbeks on one plot, and Pathans on another.

MDS 1&2
MDS 1&3
MDS 1&4