## Tuesday, January 16, 2007

### The scientific basis for race

Steve Hsu:

Suppose that the human genome has 30,000 distinct genes, which we will label as i = 1,2, ... N, where N = 30k. Next, suppose that there are n_i variants or alleles (mutations) of the i-th gene. Then, each human's genetic information can be described as a point on a lattice of size n_1 x n_2 x n_3 ... n_N, or equivalently an N-tuple of integers, each of whose values range from 1 to n_i. For the simplified case where there are exactly 10 variants of each gene, the number of points in this N dimensional space is 10^N or 10^{30k}, one for each distinct 30k digit number. It's a space of very high dimension, but this doesn't stop us from defining a metric, or definition of distance between any two points in the space. (For simplicity we ignore restrictions on this space which might result from incompatibility of certain combinations, etc.)

Note that the genomes of all of the humans who have ever lived occupy only a small subset of this space -- most possible variations have never been realized. For this reason, the surprise expressed by biologists that humans have so few genes (not many more than a worm, and far less than the 100k of earlier estimates) is no cause for concern -- the number of possible organisms that might result from 30k genes is enormous -- far more than the number of molecules in the visible universe.

To define a metric, we need a notion of how far apart two different alleles are. We can do this by counting base pair differences -- most mutations only alter a few base pairs in the genetic code. We can define the distance between two alleles in terms of the number of base pair changes between them (this is always a positive number). Then, we can define the distance between two genomes as the sum of each of the i=1,2,..,N individual gene distances. It is natural, although perhaps not always possible, to choose the n_i labeling of alleles to reflect relative distances, so variants n_1 and n_2 are close together, and both very far from n_10.

The exact definition of the metric and the allele labeling are somewhat arbitrary, but you can see it is easy to define a meaningful measure of how far apart any two individuals are in genome space.

Now plot the genome of each human as a point on our lattice. Not surprisingly, there are readily identifiable clusters of points, corresponding to traditional continental ethnic groups: Europeans, Africans, Asians, Native Americans, etc. (See, for example, Risch et al., Am. J. Hum. Genet. 76:268–275, 2005.) Of course, we can get into endless arguments about how we define European or Asian, and of course there is substructure within the clusters, but it is rather obvious that there are identifiable groupings, and as the Risch study shows, they correspond very well to self-identified notions of race.

Race: the current consensus