The zalpha package contains statistics for identifying areas of the genome that have undergone a selective sweep. The idea behind these statistics is to find areas of the genome that are highly correlated, as this can be a sign that a sweep has occurred recently in the vicinity. For more information on the statistics, please see the paper by Jacobs et al. (2016)[1] referenced below.
Here we have a dataset containing five humans with a single pair of chromosomes each. The chromosome has 20 SNPs, at base pair 100, 200, 300, …, 2000.
Person 1 | Chromosome A | AGGAAGGGATACGGTTATAC |
Chromosome B | CGGAACTGATCAGCTCAGGG | |
Person 2 | Chromosome A | CGGTACTGTCACGGGCATGG |
Chromosome B | ATGTAGGGTCCAGCTCTGAC | |
Person 3 | Chromosome A | ATGTCGGCATCCAGGCAGAC |
Chromosome B | ATGAACGCATCAACTTTGAG | |
Person 4 | Chromosome A | CGGTCGGCTCCAGCTTTTGG |
Chromosome B | CTGTCCGCTCCCGGGTTTGC | |
Person 5 | Chromosome A | CGCAACGGACACGCGCATGC |
Chromosome B | CTGACGTCACCCAGTTTGAG |
For this simple example all that is needed is:
The vector of SNP locations
The matrix of SNP values. This could be in ACGT format as above, or in 0 and 1 notation, or any other notation as long as SNPs are biallelic. Data extracted from a PLINK .tped file is in the ideal format for this analysis. Note the genetic data is phased.
The snps dataset is a data frame that comes with the zalpha package. It is identical to the simple example above, but with 0s and 1s instead of ACGTs.
Realistically, the dataset would be much bigger. It is highly recommended to only use SNPs with a minor allele frequency of over 5%, as it is hard to find correlations between rare alleles. Any missing values should be coded as NA.
The snps dataset can be loaded using the code:
library(zalpha)
data(snps)
## This is what the dataset looks like:
snps
#> bp_positions cM_distances chrom_1 chrom_2 chrom_3 chrom_4 chrom_5 chrom_6
#> 1 100 0.0001 1 0 0 1 1 1
#> 2 200 0.0002 1 1 1 0 0 0
#> 3 300 0.0005 1 1 1 1 1 1
#> 4 400 0.0008 0 0 1 1 1 0
#> 5 500 0.0010 1 1 1 1 0 1
#> 6 600 0.0015 1 0 0 1 1 0
#> 7 700 0.0017 1 0 0 1 1 1
#> 8 800 0.0019 0 0 0 0 1 1
#> 9 900 0.0023 1 1 0 0 1 1
#> 10 1000 0.0024 0 0 1 1 0 0
#> 11 1100 0.0026 1 0 1 0 0 0
#> 12 1200 0.0027 1 0 1 0 1 0
#> 13 1300 0.0032 1 1 1 1 0 0
#> 14 1400 0.0037 1 0 1 0 1 0
#> 15 1500 0.0039 0 0 1 0 1 0
#> 16 1600 0.0040 1 0 0 0 0 1
#> 17 1700 0.0041 0 0 0 1 0 1
#> 18 1800 0.0045 1 0 1 0 0 0
#> 19 1900 0.0048 1 0 0 1 1 1
#> 20 2000 0.0049 0 1 1 0 0 1
#> chrom_7 chrom_8 chrom_9 chrom_10
#> 1 0 0 0 0
#> 2 1 0 1 0
#> 3 1 1 0 1
#> 4 1 1 0 0
#> 5 0 0 1 0
#> 6 1 0 0 1
#> 7 1 1 1 0
#> 8 1 1 0 1
#> 9 0 0 1 1
#> 10 1 1 1 1
#> 11 0 0 1 0
#> 12 0 1 1 1
#> 13 1 1 1 0
#> 14 0 1 0 1
#> 15 0 1 1 0
#> 16 1 1 0 1
#> 17 1 1 0 1
#> 18 1 1 1 0
#> 19 0 0 0 1
#> 20 1 0 0 1
This data set contains information about each of the SNPs. The first column gives the physical location of the SNP along the chromosome, in whatever units is useful to the user (usually bp or Kb). In this example, the positions are in base pairs (bp).
The next column is the genetic distance of the SNP from the start of the chromosome. Ignore this column for now.
The final columns are the SNP alleles for each of the chromosomes in the population. Each SNP must be biallelic, but can contain any value, for example 0s and 1s, or ACGTs. The data can contain missing values, however it is recommended that the cut off is 10% missing at most. Missing values should be coded as NA. It is also recommended to use a minor allele frequency of 5% or higher.
Note: There is no requirement to put data into a data frame - all that is required is a vector of SNP positions and a matrix of SNP values.
To test for selection, the user can use the Zalpha function. This function assigns the first SNP in the dataset as the “target locus”, calculates the Zα value, then moves on to the next SNP making that the target locus, until every SNP in the dataset has been considered. It works by calculating correlations between alleles on each side of the target locus and averaging them. To do this, the function needs three inputs:
pos A vector of the physical locations of each of the SNPs. For this example, we will use the first column from the snps dataset: snps$bp_positions.
ws The window size. This is set to 3000 bp for this small example but for human analysis realistically a window size of around 200 Kb is appropriate. The window is centred on the target locus and considers SNPs that are within ws/2 to the left and ws/2 to the right of the target SNP. ws should always use the same units as pos i.e. if pos is in bp, ws should be in bp.
x A matrix of the SNP alleles across each chromosome in the sample. The number of rows should be equal to the number of SNPs, and the columns are each of the chromosomes. For this example we extract the SNP values from the snps dataset found in columns 3 to 12, and convert into a matrix: as.matrix(snps[,3:12]).
results<-Zalpha(snps$bp_positions,3000,as.matrix(snps[,3:12]))
results
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zalpha
#> [1] NA NA NA NA 0.09893585 0.10944106
#> [7] 0.11602964 0.11215782 0.12419556 0.12962040 0.12583275 0.11980489
#> [13] 0.10802033 0.12064462 0.10873917 0.10267168 NA NA
#> [19] NA NA
plot(results$position,results$Zalpha)
The output is in the form of a list and shows the positions of each of the SNPs and the Zα value calculated for each SNP. The NAs are because there were not enough SNPs on one side of the target locus for an accurate Zα value to be calculated. This is controlled by the parameters minRandL and minRL, which have defaults 4 and 25 respectively. minRandL specifies the minimum number of SNPs that must be to the left and right of the target SNP within the window for Zα to be calculated. minRL is the product of these numbers.
The graph shows a sharp increase in Zα values in the centre of this region, which could indicate the presence of a sweep. The user should compare the values across the whole genome to find outliers.
Say the user is only interested in the output of Zalpha for a particular region of the chromosome; this is achieved by setting the “X” parameter to the lower and upper bounds of the region.
Zalpha(snps$bp_positions,3000,as.matrix(snps[,3:12]),X=c(500,1000))
#> $position
#> [1] 500 600 700 800 900 1000
#>
#> $Zalpha
#> [1] 0.09893585 0.10944106 0.11602964 0.11215782 0.12419556 0.12962040
That concludes the simple example of the Zalpha function!
It is recommended that the user uses the Zalpha_all function, as this function will calculate all the statistics in the zalpha package in one go, rather than running all of the statistics separately. More information on the Zalpha_all function can be found further down this vignette. Read on for information on the other statistics in the package and what they require.
There are many reasons apart from selection that pairs of SNPs could be more correlated than the rest of the genome, including regions of low recombination and genetic drift. This package allows the user to correct for expected correlations between SNPs. There are multiple functions included in this package that adjust for expected correlations, all of which have an example below. First however, the new inputs will be described. The extra inputs required are:
dist A vector containing the genetic distances between SNPs
An LD profile
Returning to the snps example dataset, we can now consider the second column of the dataset “cM_distances”.
snps$cM_distances
#> [1] 0.0001 0.0002 0.0005 0.0008 0.0010 0.0015 0.0017 0.0019 0.0023 0.0024
#> [11] 0.0026 0.0027 0.0032 0.0037 0.0039 0.0040 0.0041 0.0045 0.0048 0.0049
Each value is the genetic distance of the SNP from the start of the chromosome. This could be in centimorgans (cM), linkage disequilibrium units (LDU) or any other way of measuring genetic distance, as long as it is additive (i.e. the distance between SNP A and SNP C is equal to the distance between SNP A and SNP B plus SNP B and SNP C).
There are many ways of calculating the genetic distances between SNPs. Some software that could be used include LDhat[2], pyrho[3], FastEPRR[4], and LDJump[5].
Using an LD (linkage disequilibrium) profile allows the user to adjust for variable recombination rates along the chromosome. An LD profile is a basic look-up table. It tells the user what the expected correlation between two SNPs is, given the genetic distance between them. Here is the example LD profile provided with the zalpha R package:
data(LDprofile)
LDprofile
#> bin rsq sd Beta_a Beta_b
#> 1 0.0000 0.12214545 0.2275545 0.2764342 1.8116558
#> 2 0.0001 0.09335562 0.2197797 0.2708984 2.0267642
#> 3 0.0002 0.12218997 0.2370013 0.2698494 1.6785461
#> 4 0.0003 0.09042594 0.2005840 0.2639042 2.2719719
#> 5 0.0004 0.09846861 0.2415291 0.2498579 1.5550545
#> 6 0.0005 0.05234892 0.1645840 0.2734530 3.5077770
#> 7 0.0006 0.09849803 0.2171552 0.2805981 2.0719602
#> 8 0.0007 0.09234729 0.2185910 0.2729321 1.8642312
#> 9 0.0008 0.05612510 0.1463234 0.3270302 4.5627156
#> 10 0.0009 0.05799673 0.1569512 0.2875387 3.8010222
#> 11 0.0010 0.06451333 0.1803580 0.2695071 2.7318059
#> 12 0.0011 0.07202593 0.1980737 0.2575224 2.2259595
#> 13 0.0012 0.10457653 0.2452326 0.2527777 1.5974515
#> 14 0.0013 0.05750545 0.1681630 0.2854224 3.3140481
#> 15 0.0014 0.09774452 0.2388637 0.1924497 0.6386457
#> 16 0.0015 0.06229074 0.1752179 0.1791416 0.6853134
#> 17 0.0016 0.08488753 0.1917311 0.3031776 2.7157927
#> 18 0.0017 0.08160874 0.2037504 0.2639457 2.1465406
#> 19 0.0018 0.08745139 0.2037958 0.2704832 2.2651764
#> 20 0.0019 0.07123330 0.1840094 0.2810918 2.8468435
#> 21 0.0020 0.09485261 0.2191376 0.2723791 1.9259026
#> 22 0.0021 0.05804671 0.1458408 0.3207629 4.4316165
#> 23 0.0022 0.06447594 0.1502141 0.3368500 4.2842122
#> 24 0.0023 0.08147056 0.2012367 0.2690378 2.2633659
#> 25 0.0024 0.09434107 0.2250389 0.2557325 1.8280552
#> 26 0.0025 0.07046571 0.1831713 0.2857604 2.8830237
#> 27 0.0026 0.08397797 0.1926024 0.2900191 2.4698181
#> 28 0.0027 0.05834056 0.1662758 0.3033592 3.2711990
#> 29 0.0028 0.06702507 0.1662532 0.2933794 3.2574873
#> 30 0.0029 0.05820796 0.1723149 0.2682267 2.8738054
#> 31 0.0030 0.09507890 0.2085120 0.2716050 2.1487500
#> 32 0.0031 0.04551497 0.1239696 0.3296394 5.8488772
#> 33 0.0032 0.04241460 0.1352978 0.3349360 5.5737334
#> 34 0.0033 0.10255730 0.2259408 0.2718446 1.9065634
#> 35 0.0034 0.05181171 0.1523572 0.2978946 4.0344464
#> 36 0.0035 0.06537539 0.1633711 0.2937305 3.4617464
#> 37 0.0036 0.07133690 0.1728133 0.2845848 3.0456476
#> 38 0.0037 0.09034151 0.2365844 0.2438361 1.6296772
#> 39 0.0038 0.06661469 0.1718022 0.2842834 3.1988517
#> 40 0.0039 0.11322964 0.2325874 0.2031107 0.6843425
#> 41 0.0040 0.10068963 0.2067405 0.2918418 2.1743057
#> 42 0.0041 0.09924437 0.2321804 0.2601480 1.7887389
#> 43 0.0042 0.07122869 0.1813322 0.2701038 2.7201233
#> 44 0.0043 0.11335340 0.2081721 0.2869211 2.0950029
#> 45 0.0044 0.11286222 0.2357003 0.2679826 1.6981865
#> 46 0.0045 0.05429260 0.1622758 0.2940277 3.7098450
#> 47 0.0046 0.07505181 0.1992234 0.2679613 2.3091843
#> 48 0.0047 0.08873456 0.1996523 0.2754293 2.3102141
#> 49 0.0048 0.06982331 0.1902250 0.2603511 2.4806350
#> 50 0.0049 0.07837201 0.1944029 0.2745399 2.5207091
The LD profile contains data about the expected correlation between SNPs given the genetic distance between them. The columns in the example are:
bin This is the lower bound of the bin. In this example, row 1 would include any SNPs greater than or equal to 0 but less than 0.0001 centimorgans apart.
rsq The expected r2 value for pairs of SNPs whose genetic distance between them falls within the bin.
sd The standard deviation of r2 for the bin.
Beta_a The first shape of the Beta distribution fitted to this bin.
Beta_b The second shape of the Beta distribution.
If we know two SNPs are 0.00017 cM apart, this LD profile tells us that we expect the r2 value to be 0.093, with a standard deviation of 0.22, and that the expected distribution of r2 values for SNPs this far apart is Beta(0.27,2.03).
The package contains a function for creating an LD profile. This is explained lower down this vignette. The vignette continues by using the example LDprofile dataset supplied.
The expected Zα value (denoted ZαE[r2]) can be calculated for a chromosome given an LD profile and the genetic distances between each SNP in the chromosome. Instead of calculating the r2 values between SNPs, the function uses the expected correlations. It does this by working out the genetic distance between each pair of SNPs and uses the r2 values given in the LD profile for SNPs that far apart.
Zalpha_expected(snps$bp_positions, 3000, snps$cM_distances, LDprofile$bin, LDprofile$rsq)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zalpha_expected
#> [1] NA NA NA NA 0.08633502 0.08169169
#> [7] 0.07981817 0.08104324 0.08134506 0.08015626 0.07984135 0.08082988
#> [13] 0.08263229 0.08040794 0.07822704 0.08289555 NA NA
#> [19] NA NA
Note that this statistic does not use the SNP value data. Once ZαE[r2] has been calculated, it can be combined with the Zα results to adjust for recombination, for example by computing Zα/ZαE[r2]. Outliers in this new combined statistic could be potential selection candidates.
Other functions that take into account variable recombination rates are Zalpha_rsq_over_expected, Zalpha_log_rsq_over_expected, Zalpha_Zscore, and Zalpha_BetaCDF. These statistics all use the actual r2 values from the data combined with the expected r2 values from the LD profile in various ways.
Examples of these statistics are here:
Zalpha_rsq_over_expected(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, LDprofile$bin, LDprofile$rsq)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zalpha_rsq_over_expected
#> [1] NA NA NA NA 1.229709 1.455017 1.557973 1.473750
#> [9] 1.663279 1.748810 1.655937 1.560111 1.348384 1.540937 1.403390 1.302492
#> [17] NA NA NA NA
Zalpha_log_rsq_over_expected(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, LDprofile$bin, LDprofile$rsq)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zalpha_log_rsq_over_expected
#> [1] NA NA NA NA -2.80109183 -2.52953641
#> [7] -2.89862430 -0.06997296 -0.04968498 -0.05720747 -0.09096136 -0.14235949
#> [13] -0.18846597 -0.11839984 -0.18570016 -3.28861224 NA NA
#> [19] NA NA
Zalpha_Zscore(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, LDprofile$bin, LDprofile$rsq, LDprofile$sd)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zalpha_Zscore
#> [1] NA NA NA NA 0.08320337 0.17092137
#> [7] 0.21026515 0.17941815 0.24933136 0.28017067 0.25141352 0.21744923
#> [13] 0.13918111 0.21735113 0.16446541 0.12300862 NA NA
#> [19] NA NA
Zalpha_BetaCDF(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances, LDprofile$bin, LDprofile$Beta_a, LDprofile$Beta_b)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zalpha_BetaCDF
#> [1] NA NA NA NA 0.5553324 0.6013140 0.6135663
#> [8] 0.6114601 0.6153488 0.6073154 0.5895785 0.6060782 0.5871007 0.6161310
#> [15] 0.5889128 0.5730982 NA NA NA NA
Note that not all the statistics need all the columns from the LD profile.
The Zbeta function works in the same way as the Zalpha function but evaluates correlations between pairs of SNPs where one is to the left of the target locus and the other is to the right. It is useful to use the Zβ statistic in conjunction with the Zα statistic, as they behave differently depending on how close to fixation the sweep is. For example, while a sweep is in progress both Zα and Zβ would be higher than other areas of the chromosome without a sweep present. However, when a sweep reaches near-fixation, Zβ would decrease whereas Zα would remain high. Combining Zα and Zβ into new statistics such as Zα/Zβ is one way of analysing this.
The Zbeta function requires the exact same inputs as the Zalpha function. Here is an example:
results<-Zbeta(snps$bp_positions,3000,as.matrix(snps[,3:12]))
results
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zbeta
#> [1] NA NA NA NA 0.1280042 0.1298619 0.1219965
#> [8] 0.1071535 0.1124896 0.1121871 0.1033178 0.1185118 0.1212802 0.1281512
#> [15] 0.1275420 0.1442328 NA NA NA NA
plot(results$position,results$Zbeta)
Comparing this to the Zα graph in the earlier example, we can see that the value of Zβ decreases where Zα increases. This could indicate that, if there is a sweep at this locus, it is near-fixation.
There is an equivalent Zbeta function for each of the Zalpha variations. Here is an example for each of them:
Zbeta_expected(snps$bp_positions, 3000, snps$cM_distances,
LDprofile$bin, LDprofile$rsq)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zbeta_expected
#> [1] NA NA NA NA 0.07935940 0.07840546
#> [7] 0.07845889 0.07726474 0.07734390 0.07801402 0.07743300 0.07714027
#> [13] 0.07798668 0.07880907 0.08001374 0.08138610 NA NA
#> [19] NA NA
Zbeta_rsq_over_expected(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances,
LDprofile$bin, LDprofile$rsq)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zbeta_rsq_over_expected
#> [1] NA NA NA NA 1.662441 1.753478 1.628663 1.463948
#> [9] 1.543292 1.559820 1.468623 1.689309 1.687198 1.759164 1.737935 1.934010
#> [17] NA NA NA NA
Zbeta_log_rsq_over_expected(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances,
LDprofile$bin, LDprofile$rsq)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zbeta_log_rsq_over_expected
#> [1] NA NA NA NA -5.81626185 -6.65751023
#> [7] -7.49094344 -6.58013806 -6.69181952 -6.54908550 -7.83646879 -6.18038047
#> [13] -7.39743178 -6.45648923 -6.00894964 0.09510038 NA NA
#> [19] NA NA
Zbeta_Zscore(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances,
LDprofile$bin, LDprofile$rsq, LDprofile$sd)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zbeta_Zscore
#> [1] NA NA NA NA 0.2644152 0.2959959 0.2473133
#> [8] 0.1761311 0.2115564 0.2148552 0.1755580 0.2596216 0.2590911 0.2916302
#> [15] 0.2848765 0.3617791 NA NA NA NA
Zbeta_BetaCDF(snps$bp_positions, 3000, as.matrix(snps[,3:12]), snps$cM_distances,
LDprofile$bin, LDprofile$Beta_a, LDprofile$Beta_b)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zbeta_BetaCDF
#> [1] NA NA NA NA 0.5904780 0.5810107 0.5627583
#> [8] 0.5507821 0.5595338 0.5637204 0.5466796 0.5865750 0.5672179 0.5854016
#> [15] 0.5899818 0.6315494 NA NA NA NA
These statistics show the diversity around the target locus. LR calculates the number of SNPs to the left of the target locus multiplied by the number of SNPs to the right. L_plus_R is the total number of pairs of SNPs on the left and the right of the target locus. The idea behind these statistics is that if the diversity is low, there might be a sweep in this region.
Care should be taken when interpreting these statistics if diversity has been altered by filtering and, when using the Zalpha_all function below, the use of minRL and minRandL parameters.
Zalpha_all is the recommended function for using this package. It will run all the statistics included in the package (Zα and Zβ variations), so the user does not have to run multiple functions to calculate all the statistics they want. The function will only calculate the statistics it has been given the appropriate inputs for, so it is flexible.
For example, this code will only run Zalpha, Zbeta and the two diversity statistics LR and L_plus_R, as an LD profile was not supplied:
Zalpha_all(snps$bp_positions,3000,as.matrix(snps[,3:12]))
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $LR
#> [1] 0 15 30 45 60 70 78 84 88 90 90 88 84 78 70 60 45 30 15 0
#>
#> $L_plus_R
#> [1] 105 105 106 108 111 101 93 87 83 81 81 83 87 93 101 111 108 106 105
#> [20] 105
#>
#> $Zalpha
#> [1] NA NA NA NA 0.09893585 0.10944106
#> [7] 0.11602964 0.11215782 0.12419556 0.12962040 0.12583275 0.11980489
#> [13] 0.10802033 0.12064462 0.10873917 0.10267168 NA NA
#> [19] NA NA
#>
#> $Zbeta
#> [1] NA NA NA NA 0.1280042 0.1298619 0.1219965
#> [8] 0.1071535 0.1124896 0.1121871 0.1033178 0.1185118 0.1212802 0.1281512
#> [15] 0.1275420 0.1442328 NA NA NA NA
Supplying an LD profile and genetic distances for each SNP will result in more of the statistics being calculated.
There are many ways that the resulting statistics can be combined to give new insights into the data, see Jacobs et al. (2016)[1].
To find candidate regions for selection, first calculate the statistics across the chromosome, including any combined statistics that may be of interest. It is then suggested to find the maximum value for windows of around 200 Kb for each statistic (minimum values for the diversity statistics). Any regions that are outliers compared to the rest of the chromosome could be considered candidates and can be investigated further.
An LD profile is required to adjust for expected correlations. It is a basic look-up table that tells the user what the expected correlation is between a pair of SNPs, given the genetic distance between them. To create a simple LD profile, the user just needs two things:
dist a vector of genetic distances
x a matrix of SNP values
The user also needs to tell the function what bin_size to use, and optionally if they want to calculate Beta distribution parameters (this requires the fitdistrplus package to be installed).
The function considers all pairs of SNPs. It separates the pairs of SNPs into bins based on the genetic distance between them. The correlation between each pair of SNPs is calculated. In each bin, the average and the standard deviation of these correlations is calculated. If beta_params=TRUE, then a beta distribution will be fitted to the correlations in each bin too.
We can use the snps dataset as an example of this:
create_LDprofile(snps$cM_distances,as.matrix(snps[,3:12]),bin_size = 0.001,beta_params = TRUE)
#> bin rsq sd Beta_a Beta_b n
#> 1 0.000 0.10226664 0.1318628 0.8232427 6.435184 59
#> 2 0.001 0.14412346 0.1745857 0.7323277 3.960080 51
#> 3 0.002 0.09538328 0.1122312 0.9962786 8.322913 41
#> 4 0.003 0.11193736 0.1303776 1.0432924 7.113517 28
#> 5 0.004 0.19393939 0.1963924 1.4733393 4.979831 11
This code has created an LD profile with 6 columns. These are:
bin This is the lower bound of the bin, e.g. row one shows information for genetic distances that are between 0 and 0.001 cM.
rsq This is the expected r2 for genetic distances who fall in the given bin. For example, in row one the expected r2 value for SNPs which are 0-0.001 cM apart is 0.1023.
sd This is the standard deviation for the r2 values.
Beta_a This is the first shape parameter for the Beta distribution fitted to this bin
Beta_b This is the second shape parameter for the Beta distribution fitted to this bin
n This is the number of pairs of SNPs with a genetic distance falling within this bin, whose correlations were used to calculate the statistics.
There is one more optional input parameter - max_dist - which sets the maximum distance SNPs can be apart for calculating for the LD profile. For real world data, Jacobs et al. (2016)[1] recommend using distances up to 2 cM assigned to bins of size 0.0001 cM. Without this parameter, the code will generate bins up to the maximum distance between pairs of SNPs, which is likely to be inefficient as most distances will not be used. max_dist should be big enough to cover the genetic distances between pairs of SNPs within the window size given when the Zα statistics are run. Any pairs with genetic distances bigger than max_dist will be assigned the values in the maximum bin of the LD profile.
Ideally, we would want to generate an LD profile based on genetic data without selection but exactly matching the other population parameters for our data. This could be done using simulated data (using software such as MSMS[6] or SLiM[7]). We could use another genetic dataset containing a similar population. Alternatively, we could generate an LD profile using the same dataset that we are analysing for selection. Care should be taken that bins are big enough to have a lot of data in so expected r2 values are not overly affected by outliers.
Realistically, the user will not have just one chromosome of data for creating the LD profile, but will likely have a whole genome. So far, we have used a vector of genetic distances and a SNP value matrix in our example. However, with multiple chromosomes there will be a vector of genetic distances and a SNP value matrix for each chromosome, and it would be good to use all that information to create the LD profile. Therefore, the function has been written to accept multiple vectors of genetic distances and multiple SNP value matrices via lists.
The dist parameter will accept a vector or a list of vectors.
The x parameter will accept a matrix or a list of matrices.
For example, if we use the snps dataset but this time pretend it is three different chromosomes.
## Generate three chromosomes of data - cM distances and SNP values
chrom1_cM_distances<-snps$cM_distances
chrom1_snp_values<-as.matrix(snps[,3:12])
chrom2_cM_distances<-snps$cM_distances
chrom2_snp_values<-as.matrix(snps[,3:12])
chrom3_cM_distances<-snps$cM_distances
chrom3_snp_values<-as.matrix(snps[,3:12])
## create a list of the cM distances
cM_distances_list<-list(chrom1_cM_distances,chrom2_cM_distances,chrom3_cM_distances)
## create a list of SNP value matrices
snp_values_list<-list(chrom1_snp_values,chrom2_snp_values,chrom3_snp_values)
## create the LD profile using the lists as the dist and x parameters
create_LDprofile(cM_distances_list,snp_values_list,bin_size = 0.001,beta_params = TRUE)
#> bin rsq sd Beta_a Beta_b n
#> 1 0.000 0.10226664 0.1311114 0.6691595 5.517397 177
#> 2 0.001 0.14412346 0.1734333 0.5921975 3.371973 153
#> 3 0.002 0.09538328 0.1113074 0.7402206 6.697932 123
#> 4 0.003 0.11193736 0.1287972 0.7396069 5.557230 84
#> 5 0.004 0.19393939 0.1901561 1.0279480 3.889066 33
Care should be taken that the chromosomes stay in the same order in each list.
Congratulations! You should now be able to create your own LD profile and use the zalpha package.
[1] Jacobs, G.S., Sluckin, T.J., and Kivisild, T. Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps. Genetics, 2016. 203(4): p. 1807
[2] McVean, G. A. T., Myers, S. R., Hunt, S., Deloukas, P., Bentley, D. R., and Donnelly, P. The Fine-Scale Structure of Recombination Rate Variation in the Human Genome. Science, 2004. 304(5670): 581-584.
[3] Spence, J.P. and Song, Y.S. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. Science Advances, 2019. 5(10): eaaw9206.
[4] Gao, F., Ming, C., Hu, W. J., and Li, H. P. New Software for the Fast Estimation of Population Recombination Rates (FastEPRR) in the Genomic Era. G3-Genes Genomes Genetics, 2016. 6(6): 1563-1571.
[5] Hermann, P., Heissl, A., Tiemann-Boege, I., and Futschik, A. LDJump: Estimating variable recombination rates from population genetic data. Molecular Ecology Resources, 2019. 19(3): 623-638.
[6] Ewing, G. and Hermisson, J. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics, 2010. 26(16):2064-2065.
[7] Haller, B.C. and Messer, P.W. SLiM 3: Forward Genetic Simulations Beyond the Wright–Fisher Model. Molecular Biology and Evolution, 2019. 36(3):632-637.