Eurogenes Genetic Ancestry Project: 2018

Saturday, August 25, 2018

Global25 workshop 3: genes vs geography in Northern Europe

To produce the intra-North European Principal Components Analysis (PCA) plot below, download this datasheet, plug it into the PAST program, which is freely available here, then select all of the columns by clicking on the empty tab above the labels, and choose Multivariate > Ordination > Principal Components or Discriminant Analysis. This is what you should end up with...

I'd say that the result more or less resembles a geographic map of Northern Europe. Of course, if you're in the possession of your own personal Global25 coordinates, you can add yourself to this plot to check whether your position matches your geographic origin.

Please keep in mind, however, that the vast majority (>90%) of your ancestry must be from north of the Alps, Balkans and Pyrenees to obtain a sensible outcome. Also please ensure that all of the columns in the datasheet are filled out correctly, including the group column, otherwise your position on the plot will be skewed.

See also...

Global25 workshop 1: that classic West Eurasian plot

Global25 workshop 2: intra-European variation

Global25 workshop 3: genes vs geography in Northern Europe

Global25 workshop 4: a neighbour joining tree

Modeling genetic ancestry with Davidski: step by step

Getting the most out of the Global25

Genetic ancestry online store (to be updated regularly)

Sunday, August 12, 2018

Global25 workshop 2: intra-European variation

Even though the Global25 focuses on world-wide human genetic diversity, it can also reveal a lot of information about genetic substructures within continental regions.

Several of the dimensions, for instance, reflect Balto-Slavic-specific genetic drift. I ensured that this would be the case by running a lot of Slavic groups in the analysis. A useful by-product of this strategy is that the Global25 is very good at exposing relatively recent intra-European genetic variation.

To see this for yourself, download the datasheet below and plug it into the PAST program, which is freely available here. Then select all of the columns by clicking on the empty tab above the labels, and choose Multivariate > Ordination > Principal Components.

G25_Europe_scaled.dat

You should end up with the plot below. Note that to see the group labels and outlines, you need to tick the appropriate boxes in the panel to the right of the image. To improve the experience, it might also be useful to color-code different parts of Europe, and you can do that by choosing Edit > Row colors/symbols. Of course, if you have Global25 coordinates you can add yourself to the datasheet to see where you plot.

Components 1 and 2 pack the most information and, more or less, recapitulate the geographic structure of Europe. However, many details can only be seen by plotting the less significant components. For instance, a plot of components 1 and 3 almost perfectly separates Northeastern Europe into two distinct clusters made up of the speakers of Indo-European and Finno-Ugric languages.

This plot might also be useful for exploring potential Jewish ancestry, because Ashkenazi, Italian and Sephardi Jews appear to be relatively distinct in this space. Thus, people with significant European Jewish ancestry will "pull" towards the lower left corner of the plot. For example, someone who is half Ashkenazi and half German will probably land in the empty space between the Northwest Europeans and Jews.

See also...

Global25 workshop 1: that classic West Eurasian plot

Global25 workshop 3: genes vs geography in Northern Europe

Global25 workshop 4: a neighbour joining tree

Modeling genetic ancestry with Davidski: step by step

Getting the most out of the Global25

Genetic ancestry online store (to be updated regularly)

Global25 workshop 1: that classic West Eurasian plot

In this Global25 workshop I'm going to show you how to reproduce the classic plot of West Eurasian genetic diversity seen regularly in ancient DNA papers and at this blog (for instance, here). To do this you'll need the datasheet below, which I'll be updating regularly, and the PAST program, which is freely available here.

G25_West_Eurasia_scaled.dat

Download the datasheet, plug it into PAST, select all of the columns by clicking on the empty cell above the labels, and go to Multivariate > Ordination > Principal Components. Here's a screen cap of me doing it:

This is what you should end up with. Please note that I also ticked the "convex hulls" box to define the populations from the "group" column in the datasheet.

Here I also ticked the "group labels" box. It's generally a useful feature, even though it makes a mess of the plot in this case due to the large number of populations.

Monday, March 19, 2018

If you're using my tools to find Jewish ancestry please read this

It's come to my attention that many people are still using the Jtest and taking the results very seriously. Indeed, perhaps too seriously.

Also, some users are doing weird stuff with the Jtest output in an attempt to estimate their supposedly "true" Ashkenazi ancestry proportions, like multiplying their Ashkenazi coefficient by three, because Ashkenazi Jews "only" score around 30% Ashkenazi in this test. Ouch! Please don't do that!

Let me reiterate that this test was only supposed to be a fun experiment. It was never meant to be the definitive online Ashkenazi ancestry test. And even as fun experiments with ADMIXTURE go, it's now horribly outdated, and probably useless for anyone with less than 15-20% Ashkenazi ancestry.

So it might be time to move on. If you really want to confirm your Jewish ancestry, either or both Ashkenazi and Sephardi, then you need to look at much more powerful and sophisticated options. One of these options is the Global25 analysis (see HERE), which can pick up minor Jewish ancestry of just a few per cent. But it's not free (USD $12), and it's a DIY test that requires a bit of time and effort to get the most out of it. Also, you'd need to send me your autosomal file so that I can estimate your Global25 coordinates. But I can help you get started and even quickly check if you have any hope at all of confirming Jewish ancestry.

If, for whatever reason, you'd rather not take advantage of the Global25 offer, because, say, you don't want to share your data with me, then it might be an idea to join the Anthrogenica discussion board and ask the experienced members there about other options [LINK].

In any case, whatever you choose to do, please remember the following points, and feel free to share them with others who are still using the Jtest:

- do not multiply your Jtest Ashkenazi score by 3 in an attempt to find your "true" Ashkenazi ancestry proportion, because this won't work for the vast majority of users

- but do compare your Jtest Ashkenazi score to those of other people of the same or very similar ancestry to yours to get a rough idea whether you might have any Ashkenazi ancestry (the Jtest population averages will be useful for this, see here)

- if you're still not sure what your Jtest results mean, then just focus on your Jtest Oracle-4 output at GEDmatch, and if you don't see AJ at the top of the oracle list, then this is a strong signal that you don't have substantial Ashkenazi ancestry

Thursday, February 15, 2018

Modeling genetic ancestry with Davidski: step by step

There are many different ways to model your genetic ancestry. I prefer the Global25/nMonte method (see here). This is a step by step guide to modeling ancient ancestry proportions with this simple but powerful method using my own genome.

As far as I know, the vast majority of my recent ancestors came from the northern half of Europe. This may or may not be correct, but it gives me somewhere to start, so that I can come up with a coherent model. If you don't have this sort of information, because, perhaps, you were adopted, then just look in the mirror, and work from there. Like I say, it's not imperative that you know anything whatsoever about your ancestry, because your genetic data will do the talking, but you do need a model when modeling.

In scientific literature nowadays Northern Europeans are often described as a three-way mixture between Yamnaya-related pastoralists, Anatolian-derived early farmers, and Western European Hunter-Gatherers (WHG). So let's see if this model works for me. Obviously, if it does, then it'll confirm the information that I have about my origins, but it might also reveal finer details that I'm not aware of. The datasheet that I'm using for this model is available here.

[1] distance%=6.9025 / distance=0.069025

Davidski

Yamnaya_Samara 53.9
Barcin_N 30.75
Rochedane 15.35
Tepecik_Ciftlik_N 0

Yep, the model does work, with a fairly reasonable distance of almost 7%. The ancestry proportions more or less match those from scientific literature and the plethora of analyses that I've featured at this blog on the topic. Please note that I've kept things very simple, using only four reference populations and individuals as proxies for four distinct streams of ancestry. But I've put my own twist on this Neolithic/Bronze Age model by including two populations from Neolithic Anatolia (Barcin_N and Tepecik_Ciftlik_N), just to see what would happen. The WHG proxy is Rochedane.

Admittedly, though, my Yamnaya cut of ancestry appears somewhat bloated at over 53%, and the model's distance is a little higher than what I normally see for really strong models. So let's check if I can get a better fitting and more sensible result by adding a slightly more easterly forager proxy than Rochedane: Narva_Lithuania.

[1] distance%=5.9331 / distance=0.059331

Davidski

Yamnaya_Samara 45.75
Barcin_N 31.45
Narva_Lithuania 22.8
Rochedane 0
Tepecik_Ciftlik_N 0

The statistical fit does improve, and when given a choice between Rochedane and Narva_Lithuania, the algorithm picks the latter as the only source of extra forager input in my genome.

What could this mean? It might mean that a large part of my ancestry derives from the Baltic region. Actually, I know for a fact that this is true. But even if I had no idea about my genealogy, this result would be a very strong hint about my genetic origins. Indeed, let's follow this trail and try to further improve the fit of the model by adding a more relevant Yamnaya-related proxy, such as early Baltic Corded Ware (CWC_Baltic_early).

[1] distance%=5.444 / distance=0.05444

Davidski

CWC_Baltic_early 54.95
Barcin_N 26.7
Narva_Lithuania 18.35
Rochedane 0
Tepecik_Ciftlik_N 0
Yamnaya_Samara 0

Holy shit! To be honest, I wasn't expecting this sort of resolution and accuracy, and I can't promise that everyone using the Global25/nMonte method will see such incredibly nuanced outcomes, but this isn't a fluke. It can't be, because it gels so well with everything that I know about my ancestry. Please note also that I belong to Y-chromosome haplogroup R1a-M417, which is a lineage intimately associated with the Corded Ware expansion across Northern Europe (for instance, see here).

But of course, the Baltic and nearby regions haven't been isolated from migrations and invasions since the Corded Ware times. For instance, at some point, probably during the Bronze Age, Uralic-speaking groups moved west across the forest zone of Northeastern Europe and into the East Baltic and northern Scandinavia. It's generally accepted that they brought Siberian admixture with them (see here). Moreover, from the Iron Age to the Middle Ages, East Central Europe was under intense pressure from a wide range of nomadic steppe groups with complex ancestry, such as the Sarmatians, Avars, Huns, and Mongolians. Did any of these peoples leave their mark on my genome? At the risk of overfitting the model, let's explore this possibility by adding a few more reference populations.

[1] distance%=5.444 / distance=0.05444

Davidski

CWC_Baltic_early 54.95
Barcin_N 26.7
Narva_Lithuania 18.35
Han 0
Mongolian 0
Nganassan 0
Rochedane 0
Sarmatian_Pokrovka 0
Tepecik_Ciftlik_N 0
Yamnaya_Samara 0

Nothing changes when I add the Han Chinese, Mongolians, Nganassans (a Uralic group from Siberia), and Sarmatians to the model. But what about if I throw in the only ancient Slav in my datasheet?

[1] distance%=2.9904 / distance=0.029904

Davidski

Slav_Bohemia 85.9
CWC_Baltic_early 7.7
Narva_Lithuania 6.4
Barcin_N 0
Rochedane 0
Tepecik_Ciftlik_N 0
Yamnaya_Samara 0

Considering that the vast majority of my recent ancestors were Poles, thus a Slavic-speaking people from near the Baltic, this outcome makes perfect sense. And check out the new distance! But the problem now is that I'm overfitting the model by using two very similar and probably very closely related references, CWC_Baltic_early and Slav_Bohemia. And overfitting should be avoided at all costs. So it might be useful to break up this effort into two models: one focusing on the Neolithic and Bronze Age, and the other on the Iron Age and Middle Ages. I'll do that soon, but not just yet, because there are still too few Iron Age and Medieval samples available from the Baltic region and surrounds for meaningful analyses of this type.

See also...

Global25 workshop 1: that classic West Eurasian plot

Global25 workshop 2: intra-European variation

Global25 workshop 3: genes vs geography in Northern Europe

Getting the most out of the Global25

Genetic ancestry online store (to be updated regularly)

Search This Blog