G-indexing in VCF files

PUBLISHED ON JUL 30, 2018 — BIOINFORMATICS, GENETICS, VCF

The gnomAD VCF files give information about the observed genotype counts, eg “GC_EAS” gives the genotype counts for individuals of East Asian decent as a comma-separated string. The order of the counts is determined by genotype-indexing. The above link gives an excellent discussion of getting the genotype index for the general case, regardless of ploidy number. Below is a brief discussion of haploid and diploid indices.

Genotype-indexing works by first indexing the reference and alternative alleles, starting with the reference allele at 0. The order of the alternative alleles is given by the order listed under the ALT field in the VCF. Consider the following variant: A/B,C,D (where A is the reference allele, and B, C, and D represent the ordered alternative alleles.) Then, the allele indices are 0, 1, 2, and 3 (respectively).

For the haploid case, the genotype index equals the allele index, because there is only one allele per individual: G(a_i) = a_i where a_i is the allele index. For the diploid case, the equation is G(a_i, a_j) = a_i + choose(a_j + 1, 2) where a_i and a_j are the allele indices and a_i <= a_j.

A simple R function to calculate the diploid G-index based on the 0-based allele indices:

G <- function(i , j) {
  stopifnot(i <= j && i %% 1 == 0 && j %% 1 == 0)
  i + choose(j + 1, 2)
}
              A        B        C        D
A A/A A/B A/C A/D
B B/B B/C B/D
C C/C C/D
D D/D
              A        B        C        D
A 00 01 03 06
B 02 04 07
C 05 08
D 09

Such that the final genotype ordering in the VCF would be: [A/A, A/B, B/B, A/C, B/C, C/C, A/D, B/D, C/D, D/D]. Ordering the genotypes in this format guarantees the same ordering despite the number of alleles.

This lines up with the G function given above. Consider the B/D genotype with the G-index of 7:

G(1, 3)
## [1] 7