______________________________________________________________________________________
INTERSNP
–
Genome-wide Interaction Analysis
_________________________________________________________________________________
Christine Herold
Tim Becker
Dmitriy Drichel
André Lacour
Tatsiana Vaitsiakhovich
Deutsches Zentrum für Neurodegenerative Erkrankungen
(DZNE)
German Center for Neurodegenerative Diseases
Bonn
Institute for Medical Biometrics, Informatics and
Epidemiology (IMBIE)
University of Bonn
Bonn, May 3, 2013
NOTE: Single-marker analysis is always carried out by default.
INTERSNP is written in C/C++ and can be operated from the command line.
Compile: g++ intersnp.cpp -o intersnp -O3.
Compile parallel version: g++ intersnp.cpp -o intersnpParallel -O3 -fopenmp.
There are two main loops which can be parallelized in INTERSNP. To compile a parallel version of
INTERSNP it’s necessary to modify the source code and to change either "#define PARALLELN 0" or
"#define PARALLELA 0". In order to parallelize the n-loop used for MC-simulations, in particular pathway
association analysis, change "#define PARALLELN 0" to "#define PARALLELN 1". To do a full
Genome-wide interaction analysis (GWIA) it’s recommended to parallelize the A-loop, change "#define
PARALLELA 0" to "#define PARALLELA 1".
Run: ./intersnp selectionfile.txt
Run parallel version: ./intersnpParallel selectionfile.txt
Various analysis options (statistical test, priorities, ...) can be specified by the user in a selection file. The
file consists of three columns: keyword, parameter, and optionally comments, and is separated by
whitespaces (tab, blank). Further columns are ignored.
To get familiar with the selection file, first of all a short example is introduced which will be described in detail.
TPED is the path to the .tped input file and TFAM is the path to the .tfam file. In this example we also use
an annotation file (ANNOTATIONFILE) to get the genetic information. After the input files we define some
quality control (QC) criteria to delete individuals and SNPs not satisfying the QC-criteria of HWE_P_CASE
and HWE_P_CONTROL, MRDIFF (missing rate) and MAF (minor allele frequency). Every analysis starts with a
single-marker test which can be specified by SINGLEMARKER. In our example we use Armitage’s trend
test (SINGLEMARKER 1). Furthermore, we would like to do a two-marker analysis, so we select
TWO_MARKER 1 and choose TEST 5, which is a test for additive interaction (see section 4.1). To
reduce the number of tests we use the statistic and genetic criteria. SINGLETOP 50000 and
M_WITH_SINGLETOP 2 means that we analyze only pairs of SNPs which are among the top
50000 single-marker results. In addition, these pairs should also be in a coding region of a
gene (GENETIC_IMPACT 2, M_WITH_GENETIC_IMPACT 3). Finally, we specify the output name
(OUTPUTNAME).
All the input files and options of the complete selection file below will be described in the following sections.
Possible data input formats are the PLINK (Purcell et al., 2007) formats (.ped/.map; .tped/.tfam;
.bed/.bim/.fam).
For instance the .tped file (TPED) contains SNP and genotype information, where each row represents a
SNP and two columns represent genotypes per individual. The first four columns are chromosome, marker
ID, genetic distance and base pair position.
NOTE: To improve running time it it recommended to sort the file by chromosome and
position.
The .tfam file (TFAM) contains family information, each row represents a person. To create a .tfam file use
the first six columns of a .ped file: Family ID, Individual ID, Father, Mother, Sex and Affection status.
For more details on the data formats visit the PLINK website http://pngu.mgh.harvard.edu/ purcell/plink/data.shtml.
To work with genetic criteria a SNP annotation file (ANNOTATIONFILE) is needed. We use the same format as in the Illumina Human610 chip annotation file. The first columns represent rs number, chromosome, base pair position and genome_build number; all the other columns show detailed information. For our different categories the following columns are important:
The file can be downloaded from our homepage (http://intersnp.meb.uni-bonn.de/ Downloads).
NOTE: GENETIC_IMPACT and M_WITH_GENETIC_IMPACT have to be selected.
NOTE: To improve running time it is recommended to sort the file by chromosome and position.
To include pathway information a pathway file (PATHWAYFILE) is needed. A file containing KEGG (Kanehisa, et al.), (O’Dushlaine et al., 2009) pathways1 can be downloaded from our homepage (Downloads). We use the KEGG_2_SNP_229.txt file with genes annotated by us. The columns present pathway name, rs number and gene.
NOTE: PATHWAY has to be selected (1).
NOTE: To improve running time it is recommended to sort the file by pathway name.
To include own covariates into the analysis use the covariate file (COVARIATEFILE). The first two columns represent Family ID and Person ID and then up to 10 covariates are allowed.
NOTE: Covariates are selected with the COVARIATES keyword. If all covariates shall be used, write 1-10;
(semicolon is important)
NOTE: Missing data value is "–" or "x".
For user-defined logistic/linear regression models that are not included in the list of tests, the model file (MODELFILE) is needed. The first column presents the name of the parameter followed by 0/1 indicators for the first and the second likelihood. The indicators define which of the compared likelihoods L1 and L2 are used.
NOTE: TEST has to be set to M.
To analyse only user-specified SNP pairs or triples a combi file (COMBILIST) is required. Each row represents one combination.
NOTE: COMBILIST has to be set to 1.
To analyse all combinations of user-specified SNPs from the .tped file a SNP file (SNPFILE) is required. Each row represents one SNP.
NOTE: SNPLIST has to be set to 1.
|
|
|||
|
|
|||
|
Keyword |
Example |
Default |
Description |
|
|
|||
|
|
|||
|
FILE |
|
|
Path to SNP/Pedigree/Genotype files. The input format is the standard PLINK format (.ped/.map) (section 2.2.1) |
|
|
|||
|
|
|||
|
TFILE |
|
|
Path to SNP/Pedigree/Genotype files. The input format is the transposed PLINK format (.tped/.tfam) (section 2.2.1) |
|
|
|||
|
|
|||
|
BFILE |
|
|
Path to SNP/Pedigree/Genotype files. The input format is the binary PLINK (format (.bed/.bim/.fam) section 2.2.1) |
|
|
|||
|
|
|||
|
FAM / TFAM / BFAM |
./test.fam |
|
Path to the .fam/.tfam file |
|
|
|||
|
|
|||
|
MAP |
./test.map |
|
Path to the .map file |
|
|
|||
|
|
|||
|
BMAP / BIM |
./test.bim |
|
Path to the .bim file |
|
|
|||
|
|
|||
|
PED |
./test.ped |
|
Path to the .ped file |
|
|
|||
|
|
|||
|
TPED |
./test.tped |
|
Path to the .tped file |
|
|
|||
|
|
|||
|
BPED / BED |
./test.bed |
|
Path to the .bed file |
|
|
|||
|
|
|||
|
ANNOTATIONFILE |
./annotationfile.txt |
|
Path to the annotation file (section 2.2.2). Only needed if genetic criteria is used |
|
|
|
|
NOTE: GENETIC_IMPACT and M_WITH_GENETIC_IMPACT have to be selected |
|
|
|
|
NOTE: To improve running time it is recommended to sort the file by chromosome and position |
|
|
|||
|
|
|||
|
PATHWAYFILE |
./pathwayfile.txt |
|
Path to the pathway file (section 2.2.3) |
|
|
|
|
NOTE: PATHWAY has to be switched on 1 |
|
|
|
|
NOTE: To improve running time it is recommended to sort the file by pathway name |
|
|
|||
|
|
|||
|
COVARIATEFILE |
./covariatefile.txt |
|
Path to the covariate file (section 2.2.4). Only needed if covariates are included into the analysis |
|
|
|
|
NOTE: Covariates are selected with the COVARIATES keyword (see below). If all covariates are to be used, write 1-10; |
|
|
|
|
NOTE: Missing data value is "–" or "x" |
|
|
|||
|
|
|||
|
MODELFILE |
./modelfile.txt |
|
Path to the model file (sections 2.2.5, 4.2.3). Only needed for user-defined logistic regression models that are not included in the list of tests |
|
|
|
|
NOTE: TEST has to be switched on M |
|
|
|||
|
|
|||
|
SNPFILE |
./snpfile.txt |
|
Path to the SNP file (section 2.2.7) |
|
|
|
|
NOTE: SNPLIST has to be put on 1 |
|
|
|||
|
|
|||
|
COMBIFILE |
./combifile.txt |
|
Path to the combi file (section 2.2.6) |
|
|
|
|
NOTE: COMBILIST has to be put on 1 |
|
|
|||
|
|
|||
|
SNPLIST |
|
0 |
Only SNPs from the SNP file will be analyzed: 1=yes, 0=no |
|
|
|||
|
|
|||
|
COMBILIST |
|
0 |
Only SNPs pairs/triples form the combi file will be analyzed: 1=yes, 0=no |
|
|
|||
|
|
|||
|
ONLY_MALE |
|
0 |
Only male persons will be analyzed: 1=yes, 0=no |
|
|
|||
|
|
|||
|
ONLY_FEMALE |
|
0 |
Only female persons will be analyzed: 1=yes, 0=no |
|
|
|||
|
|
|||
|
POSCHOICE |
chr4; chr12,123000-160000; chr24; |
|
Selection of certain regions (ONLY the stated regions will be analyzed) |
|
|
|
|
NOTE: Use the notation shown in the example |
|
|
|||
|
|
|||
|
NEGCHOICE |
chr4; chr12,123000-160000; chr24; |
|
Exclusion of certain regions (The stated regions will NOT be analyzed) |
|
|
|
|
NOTE: Use the notation shown in the example |
|
|
|||
|
|
|||
|
QT |
|
0 |
Use the analysis of quantitative traits: 1=yes, 0=no (case-control) |
|
|
|||
|
|
|||
|
MISSING_PHENO |
-9 |
-99/0 |
Specify the missing phenotype value |
|
|
|
|
By default it is -99 for QT 1, and is 0 for QT 0 |
|
|
|
|
NOTE: When using QT 1, make sure that the missing phenotype value is defined |
|
|
|||
|
|
|||
|
LD_DISTANCE |
|
|
Define the LD distance between SNP pairs, in base pairs (bp) |
|
|
|||
|
|
|||
|
HWE_P_CASE |
0.001 |
1 |
QC-threshold for HWE in cases (p-value cutoff) |
|
|
|||
|
|
|||
|
HWE_P_CONTROL |
0.01 |
1 |
QC-threshold for HWE in controls (p-value cutoff) |
|
|
|||
|
|
|||
|
|
|||
|
MRDIFF |
0.02 |
1 |
QC-threshold missing rate (mr): individuals and SNPs worse than the average mr + MRDIFF will be deleted (iterative algorithm) |
|
|
|||
|
|
|||
|
MAF |
|
0 |
SNPs with lower MAF will be deleted |
|
|
|||
|
|
|||
|
SINGLE_MARKER |
|
|
1=Armitage’s trend test, 2=genotype test with 2 d.f., 3=logistic regression with 1 d.f., 4=logistic regression with 2 d.f. |
|
|
|||
|
|
|||
|
TWO_MARKER |
|
0 |
Change parameter from 0 to 1 to perform two-marker analysis |
|
|
|
|
NOTE: It’s not possible to select two-marker and three-marker analysis together |
|
|
|||
|
|
|||
|
THREE_MARKER |
|
0 |
Change parameter from 0 to 1 to perform three-marker analysis |
|
|
|
|
NOTE: It’s not possible to select three-marker and two-marker analysis together |
|
|
|||
|
|
|||
|
PRETEST |
|
1 |
Pre-test is selected: 1=yes, 0=no |
|
|
|||
|
|
|||
|
PRETEST_CUTOFF |
|
0.05 |
Pre-test threshold. If Pre-test yields p-value <= cutoff, TEST is conducted |
|
|
|||
|
|
|||
|
TEST |
|
|
Multi-marker tests: 1=chi-square test on contingency table (8 d.f.), 2=log-linear model (4 d.f., test FOR interaction), 3-12=logistic regression models, 15=case-only allelic interaction, 16=case-only SNP-SNP interaction, M=User-defined logistic regression model (MODELFILE). Detailed description are presented in the model table |
|
|
|
|
NOTE: tests 1-2 are possible for two-marker and three-marker analysis, tests 3-8 are used only for two-marker analysis and tests 9-12 are only for three-marker analysis |
|
|
|||
|
|
|||
|
COVARIATES |
1-5;7;8; |
|
Select the covariates which are to be considered, separate by “;”, condense with “-”. Numbers must be put in ascending order. |
|
|
|
|
NOTE: It’s only possible to use this option if the path to the covariate file (keyword: COVARIATEFILE) is declared. |
|
|
|||
|
|
|||
|
SEXCOV |
|
|
Sex as a covariate: 1=yes, 0=no. Can be selected without a covariate file |
|
|
|||
|
|
|||
|
SINGLETOP |
1000 |
|
Based on these single-marker p-values (Armitage’s trend test or genotype test with 2 d.f.), a list of n top SNPs will be computed. The length can be specified with this option |
|
|
|||
|
|
|||
|
M_WITH_SINGLETOP |
2 |
|
Number of SNPs (0,1,2,3) which have to be taken from the single-marker p-values toplist |
|
|
|
|
NOTE: SINGLETOP has to be specified |
|
|
|
|
NOTE: M_WITH_SINGLETOP 3 is only possible when THREE_MARKER 1 |
|
|
|||
|
|
|||
|
GENETIC_IMPACT |
3 |
|
Genetic impact: 0=none (gene desert), 1=close to gene, 2=within gene, 3=within coding region, 4=non-synonymous |
|
|
|
|
NOTE: To use this option the path to the annotation file (keyword: ANNOTATIONFILE) has to be declared |
|
|
|||
|
|
|||
|
|
|||
|
M_WITH_GENETIC_IMPACT |
2 |
|
Number of SNPs (0,1,2,3) involved in the analysis which shall have at least the selected genetic impact |
|
|
|
|
NOTE: GENETIC_IMPACT has to be specified |
|
|
|
|
NOTE: M_WITH_GENETIC_IMPACT 3 is only possible when THREE_MARKER 1 |
|
|
|||
|
|
|||
|
SNP1 |
|
|
Fix first SNP for analysis, specify rs number |
|
|
|||
|
|
|||
|
SNP2 |
|
|
Fix second SNP for analysis, specify rs number |
|
|
|||
|
|
|||
|
SNP3 |
|
|
Fix third SNP for analysis (possible only when THREE_MARKER 1), specify rs number |
|
|
|||
|
|
|||
|
PATHWAY |
|
|
Pathway information is used as a prior for interaction analysis: 1=yes, 0=no |
|
|
|
|
NOTE: To use this option the path to the pathway file (keyword: PATHWAYFILE) has to be declared |
|
|
|||
|
|
|||
|
PATHWAYANALYSIS |
|
0 |
Pathway association analysis with one of the six pathway association tests (chapter 6) |
|
|
|
|
NOTE: Choose, for instance, a number of SNP, multiply it by 0.05 (50000×0.05 = 25000 →SINGLETOP 25000). Simulations should be put on at least 999 |
|
|
|
|
NOTE: To use this option the path to the pathway file (keyword: PATHWAYFILE) has to be declared |
|
|
|||
|
|
|||
|
PATHWAYTEST |
|
2 |
1=SNP ratio, 2=Fisher score (all significant SNPs), |
|
|
|
|
3=gene ratio (ALIGATOR), 4=Fisher Max (best SNP per gene), |
|
|
|
|
5=Fisher MaxPlus |
|
|
|
|
NOTE: PATHWAYANALYSIS has to be selected. |
|
|
|
|
NOTE: To use this option the path to the pathway file (keyword: PATHWAYFILE) has to be declared |
|
|
|||
|
|
|||
|
P_INTERRATIO |
|
|
p-value cutoff for TEST 6 |
|
|
|||
|
|
|||
|
HAPLO |
|
|
Genome-wide Haplotype Analysis (GWHA) for all SNPs pairs/triples less than HAPLO_DIST apart |
|
|
|
|
NOTE: To use two-marker select TWO_MARKER 1, for three-marker select THREE_MARKER 1 |
|
|
|
|
NOTE: To use this option the following options have to be selected: HAPLO 1 and HAPLO_DIST have to be specified |
|
|
|||
|
|
|||
|
|
|||
|
HAPLO_DIST |
50000 |
0 |
To define the haplotype distance. Here all pairs/triples less than 50000 bp apart |
|
|
|||
|
|
|||
|
DOHAPFILE |
0 |
|
Produce proxy-file for YAMAS (section 2.5) |
|
|
|||
|
|
|||
|
SIMULATION |
|
|
Number of MC-simulations (0=no simulations will be conducted) |
|
|
|||
|
|
|||
|
MC_WITH_SM |
|
0 |
1=MC-simulations account for multi-marker AND single-marker tests. 0=MC-simulations account only for multi-marker tests |
|
|
|||
|
|
|||
|
PRINTTOP |
100 |
|
Best n multi-marker p-values are printed |
|
|
|||
|
|
|||
|
PFILTER |
|
|
p-value cutoff for multi-marker analysis |
|
|
|||
|
|
|||
|
OUTPUTNAME |
./myproject/test |
|
Specify the output name |
|
|
|||
|
|
|||
|
ANNOTATE |
|
|
Add annotation information in the output files |
|
|
|||
|
|
|||
|
GENECOL |
|
5 |
Specify column number of gene in the annotation information |
|
|
|||
|
|
|||
|
DOIBS |
|
0 |
Calculate average IBS-value of all possible pairs: 1=yes and stop afterwards, 2=yes and proceed with analysis, 0=no |
|
|
|||
|
|
|||
|
IBS_SD_RELATIVES |
|
4.0 |
Limit in standard deviations for high/low-IBS. Pairs that exceed this limit are considered to be relatives |
|
|
|||
|
|
|||
|
IBS_SD_OUTLIER |
|
4.0 |
Limit in standard deviations. People whose low-IBS-count exceed this limit are considered to be outlier |
|
|
|||
|
|
|||
|
STRATIFY_STRUCT |
|
0 |
PAIR, GROUP or CLUSTER |
|
|
|||
|
|
|||
|
STRATIFY_VALID |
|
VICINITY |
CLUSTER, VICINITY |
|
|
|||
|
|
|||
|
MATCHINGCHOICE |
chr4;chr12,123000-160000;chr24; |
|
Selection of certain regions (ONLY the stated regions will be considered for the pairwise matching) |
|
|
|||
|
|
|||
|
CLUSTER_COVARS |
1 |
0 |
Creates one covariate for each cluster and puts 0/1 for individuals |
|
|
|||
|
|
|||
|
FAMCLUSTER |
1 |
0 |
Creates permution clusters from families, overrides STRATIFY |
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
The following output files are created by the program:
The parameters in the output file depend on the used single-marker-test (Armitage’s trend test, genotype test with 2 d.f., logistic regression with 1 d.f., logistic regression with 2 d.f.). The following list shows all possible parameters.
With INTERSNP it is possible to produce a file with r2-values for all SNPs that lie within a pre-specified distance, for instance from HAPMAP and/or 1000 Genomes data (.tped/.tfam files). In addition to the basic keywords (TPED/TFAM, QC options,...), the following keywords have to be specified in the selectionfile:
In the above example, r2 will be computed for all SNPs with a distance smaller than 10000 bp. The
outputfile (*hapfile.txt) will contain all SNP pairs for which r2 ≥ 0.20.
Example:
A more detailed output is obtained with the choice DOHAPFILE 2.
The first seven columns are the same as before. Columns 8-9 are allele frequencies for SNP 1, columns 10-11 are allele frequencies for SNP 2. Columns 12-15 are the four haplotype frequencies, here in lexicographic order, i.e. ignoring the proxy definition. Column 16 is D′, column 17 is r2, and column 18 is the proxy indicator. The computation of the indicator is based on the allele and haplotype frequencies listed.
For more details visit the YAMAS web-page http://yamas.meb.uni/bonn.de/
All pairs of SNPs with at least one SNP among the top 10 single-marker results
All pairs of SNPs which are among the top 1000 single-marker results
All pairs of SNPs which are among the top 50000 single-marker results and also lie in a coding region of a gene
All pairs of non-synonymous SNPs
All pairs of SNPs which are among the top 50000 single-marker results and also lie in the same pathway
By default, a single-marker p-value is computed for all QC-SNPs. For the autosomes and the pseudoautosomal region of the Y chromosome the user can either choose Armitage’s trend test (Armitage, 1955) or the genotype test with 2 degrees of freedom (d.f.). It is also possible to include covariates in the logistic/linear regression, the allelic test with 1 d.f. and the genotypic test with 2 d.f. Another advantage of the logistic/linear regression is that all chromosomes, including X and Y, can be analyzed. Without logistic/linear regression Y-chromosomal markers are evaluated with the χ2-test for the 2 × 2 table of allele counts in male individuals. For X-chromosomal SNPs, we use the allele-based test with 1 d.f. suggested by D. Clayton (Clayton, 2008). Claytons’s approach guarantees that hemizygote males and homozygote females contribute equally to the test statistic, reflecting the fact that only one of the X-chromosomes is active in females.
For two SNPs, there are 3 × 3 = 9 two-marker genotypes. We consider the respective 3 × 3 × 2
contingency table of counts in cases and controls. Let T be the standard test statistic for contingency
tables, i.e., T = ∑
i,j,k
, where i,j ∈{1,2,3}, k ∈{1,2}, Oi,j,k is the observed number of
counts in cell (i,j,k) and Ei,j,k is the number of counts in cell (i,j,k) expected under the null hypothesis.
T is χ2-distributed with 8 d.f. By construction, TEST 1 does not yield a test for interaction, but a test that
has power if both SNPs interact. The test can become significant through marginal SNP effects
alone.
Following (Bishop et al., 2007) the observations xijk of the 3 × 3 × 2 contingency table are
modeled by fitting a log-linear model with expected cell counts mijk to the cell entries
log mijk = u + u1(i) + u2(j) + u3(k) + u12(ij) + u13(ik) + u23(jk) without the term u123(ijk). Testing the null
hypothesis H0 : u123(ijk) = 0 yields an explicit test for genotypic interaction. With maximum likelihood
estimates mijk for the cell counts we obtain the test statistic T = 2∑
i,j,kxijk log
that is χ2-distributed
with 4 d.f. The test is equivalent to the respective 4 d.f. genotypic test in the logistic regression framework,
but is much faster to compute.
TEST 1 and TEST 2 are quick screening tests.
With logistic regression (Cordell and Clayton, 2002) a more flexible modeling is possible: Test for and under interaction, allelic und genotypic tests and the use of covariates.
We define logit(p) := log(
) = βT x, where β is the vector of coefficients to be estimated and x is a
vector representing genotype information.
We follow the coding suggested by (Cordell and Clayton, 2002). For each SNP i, i ∈{1,2,3}, we model its allelic effect xi by coding the genotypes (1,1), (1,2), and (2,2) as x1 = -1, x2 = 0, x3 = 1. Next, we model dominance effects xi,D, i ∈{1,2,3}, as x1,D = -0.5, x2,D = 0.5, x3,D = -0.5 for the genotypes (1,1), (1,2), and (2,2), respectively. By multiplication, we obtain interaction terms, for instance, x1x2 represents allelic interaction between SNPs 1 and 2, while x1,Dx2,D represents interaction between the dominance terms of SNPs 1 and 2. Note that these interaction terms code interaction on an additive logit scale and, hence, on a multiplicative odds ratio scale.
Let β0 be the intercept parameter that defines the baseline likelihood L0 := logit(p) = β0. Next, the likelihood
L1A := β0 + β1x1 models the allelic effect of SNP 1 and comparison to L0 yields a likelihood-ratio test with
1 degree of freedom. Analogously, comparison of L1G := β0 + β1x1 + β1,Dx1,D to L0 yields a genotypic
test for SNP 1 with 2 d.f. In general, we let L1,2A and Li,jG denote likelihoods containing allelic terms, or,
respectively, allelic and genotypic terms, for SNPs 1 and 2. In addition let, L1,2A,I and L1,2G,I be likelihoods
that also contain interaction terms, for instance, L1,2A,I = β0 + β1x1 + β2x2 + β1,2x1x2 and
L1,2G,I = β0 +β1x1 +β1,Dx1,D +β2x2 +β2,Dx2,D +β1,2x1x2 +β1,2Dx1x2,D +β1D,2x1,Dx2 +β1D,2Dx1,Dx2,D
To analyze quantitative traits we use linear regression. The parameter coding is equal to logistic regression. For linear regression, we define y = βT x, where y is a quantitative trait variable, and the sum of squared errors (SSE) is computed. The errors measure, per person i, the deviation between the observed value yi and its estimate ŷi predicted by the model.
For two SNPs, comparison of L1,2A against L0 yields a full allelic test with 3 d.f. (TEST 3), while
comparison of L1,2G against L0 yields a full genotypic test with 8 d.f. (TEST 4). In order to test for allelic
interaction, one compares L1,2A,I against L1,2A (1 d.f., TEST 5) and in order to test for genotypic
interaction one compares L1,2G,I against L1,2G (4 d.f., TEST 6). Furthermore, it is possible to test for the
additional effect of SNP 2 while controlling for the effect of SNP 1 by comparing L1,2A,I against L1A (2 d.f.,
TEST 7), or, by comparing L1,2G,I against L1G (6 d.f., TEST 8). These tests are summarized in the
model Table 4.1 and serve as INTERSNP standard tests that can called via their test number.
For three SNPs, likelihoods and tests can be formalized analogously. Here, we describe only allelic
tests, but note that INTERSNP allows also explicit model specification, in particular genotypic
3-SNP tests. Let L1,2,3A = β0 + β1x1 + β2x2 + β3x3 be the 3-SNP allelic likelihood, let
L1,2,3A,I2 = β0 +β1x1 +β2x2 +β1,2x1x2 +β3x3 be the 3-SNP allelic likelihood including the interaction term for
SNPs 1 and 2 and let L1,2,3A,I = β0 + β1x1 + β2x2 + β1,2x1x2 + β3x3 + β1,3x1x3 + β2,3x2x3 + β1,2,3x1x2x3
be the allelic likelihood including all pairwise and the three-fold interaction term. Then, testing L1,2,3A,I
against L0 yields a full 3-SNP allelic test with 7 d.f. (TEST 9), while testing L1,2,3A,I against L1,2,3A yields a
test for two- and three-fold allelic interaction (4 d.f., TEST 10). Testing L1,2,3A,I against L1,2,3A,I2 yields a
test for three-fold allelic interaction (1 d.f., TEST 11). Finally, testing L1,2,3A,I against L1,2A,I
yields a test for the additional allelic effect of SNP 3 while controlling for SNPs 1 and 2 (4 d.f.,
TEST12).
For X-chromosomal markers, all dominance and dominance interaction terms are ignored.
All regression tests can be combined with up to 10 covariates (Options: COVARIATES, COVARIATEFILE). In
particular, adjustment for population strata is possible.
Further tests using logistic regression can be specified via a modelfile. Choose M for the option TEST and specify the path to the modelfile (MODELFILE option). Include/exclude likelihood parameters with 1/0, respectively, as shown in the modelfile provided in the download area. Likelihood L1 is tested against likelihood L2.
Linear and logistic regression allow flexible modeling and inclusion of covariates, but their computation can be time-consuming, mostly because of the operations on the relatively large design matrices involved. Hence, when a Genome-wide Interaction Analysis shall be conducted, it makes sense to, first, apply a simple, easy-to-compute, pre-test to all SNP pairs and, second, to apply the more sophisticated test only to those pairs for which the pre-test yields a p-value smaller than PRETEST_CUTOFF. A cutoff of 0.01 is a reasonable choice, since it will reduce the number of computations of the "proper" test by a factor of at least 100 on average, but guarantees that no "real" p-values smaller than 10-5 are missed. Since p-values in the range of 10-12 are required for a complete GWIA (Becker, 2011), the loss of pairs with p-values less significant than 10-5 will in most cases be irrelevant. Even a cutoff of 0.001 or 0.0001 for the pre-test would be acceptable. The only situation in which relevant SNP pairs may be missed occurs when there are covariates with an extreme impact on the p-value, simply because the pre-tests ignore covariates.
When PRETEST 1 is selected in combination with TEST 3 (full allelic test, 3 d.f.) or TEST 5 (test for allelic interaction, 1 d.f.), a log-linear model with 1 d.f. (test for allelic interaction) is applied as a pre-test. We define the log-linear model in analogy to the log-linear model in the genotypic case (our TEST 2, section 4.2.2 ). There, the 3 × 3 × 2 genotype contingency table is considered. In the allelic case, we consider the 2 × 2 × 2 two-SNP-allele table, where the first dimension reflects the two alleles of SNP 1, the second dimension reflects the two alleles of SNP 2, and the last dimension reflects cases/control-status. Double heterozygote cases (controls) contribute 0.5 units to each cell of the 2 × 2 table of cases (controls). This practice is a simple work-around to treat double heterozygotes but renders the accompanying log-linear test conservative when its test statistic is evaluated with the χ2-distribution with 1 d.f. In this sense, the 1 d.f. log-linear test is not a proper test. However, its p-values are strongly correlated with the p-values of the test for allelic interaction in the logistic model, and, therefore, are well suited to serve as a pre-test, when the cutoff is chosen as suggested above.
In the genotypic case, the log-linear test (TEST 2, section 4.2.2) would be the obvious pre-test. Since this is a proper test in its own right, a useful strategy is to conduct a complete GWIA with TEST 2, produce a COMBILIST (section 2.2.6) with the best results, and run another test on the COMBILIST.
When PRETEST 1 is selected in combination with TEST 4 (full genotype test, 8 d.f.) or TEST 6 (test for genotypic interaction, 4 d.f.), an ANOVA with 4 d.f. (test for genotypic interaction) is applied as a pre-test. Note that when the pre-test is combined with TEST 4, pairs with marginal effects in both SNPs but with just modest interaction may be missed.
When PRETEST 1 is selected in combination with TEST 3 (full allelic test, 3 d.f.) or TEST 5 (test for allelic interaction, 1 d.f.), an ANOVA with 1 d.f. (test for allelic interaction) is applied as a pre-test. Double heterozygotes are treated as described for the logistic regression, with the same consequences. Note that when the pre-test is combined with TEST 3, pairs with marginal effects in both SNPs but with just modest interaction may be missed.
Case-only (or case-series, case-control without controls) design is based on a sample consisting of cases only. It is an efficient approach to test for gene-environment or gene-gene interaction in disease etiology, when there is a strong reason to believe that particular assumptions are fulfilled. Concerning a test for gene-environment interaction it is necessary to assume that the genetic type and the exposure occur independently in the population. A test for gene-gene interaction is reasonable to perform assuming that the frequencies of genetic variants are independent in the population or, equivalently, that the genes/SNPs under study are in linkage equilibrium.
Here, we accept that the genetic interaction between two genetic factors (e.g. genes, SNPs, alleles) is present if the joint effect of two factors on a disease differs from the additive combination of independent effects from both of them. The joint effect of two genetic factors is the effect due to the presence of both of them. The independent effect of one of the factors is its effect on a disease in the absence of the other factor. Sometimes the term "epistasis" is used instead of genetic interaction in the literature. The same definition is valid for the gene-environment interaction, where one of the genetic factors is replaced by the exposure.
From now on only the gene-gene interaction is addressed.
If the data satisfy the assumptions discussed above, then the so called case-only test for genetic interaction can be performed by using a χ2- test of independence between genotypes (with four degrees of freedom) or between alleles (with one degree of freedom) as well as logistic regression models. Notably, the case-only analysis leads sometimes to a more valid estimation of interaction than the corresponding case-control analysis, i.e. case-only based tests possess a higher power, see e.g. (Cordell, 2009), (Gatto et al., 2004), (Piegorsch et al., 1994), (Yang et al., 1999).
INTERSNP suggests several tests to check for the gene-gene interaction, where the genetic factors are chosen to be a pair of SNPs (SNP-SNP interaction). Starting from a pair of SNPs, the allelic interaction test can be performed as well.
IMPORTANT: In our study it was observed that the LD distance between two SNPs has to be at least 20 Mb (20000000 bp) in order to satisfy the assumption of linkage equilibrium. The recommended LD distance is even bigger and equals 50 Mb.
TEST 15 in INTERSNP is asymptotically equivalent to the case-only allelic interaction test in whole-genome analysis package PLINK and based on a χ2-test of independence between two alleles with one degree of freedom, see e.g. (Cordell, 2009), (Purcell et al., 2007). For this test the given SNPs data are organized in a 3 × 3 table of genotypic counts, which is in turn reduced to the 2 × 2 table of allelic counts.
TEST 16 in INTERSNP suggests to check for SNP-SNP interaction in the case-only mode. TEST 16 is based on a χ2-test of independence between two SNPs with four degrees of freedom, where the 3 × 3 table of genotypic counts is used. Let xij, i,j ∈{1,2,3} be the observed counts. Denote x+j = ∑ i=13xij, j ∈{1,2,3}, xi+ = ∑ j=13xij,i ∈{1,2,3}, and x++ = ∑ i,j=13xij. Expected counts are determined by


An example of a selection file to perform the case-only TEST 15 with INTERSNP is given below. Analogously, a selection file for the TEST 16 can be composed.
Our paper "Integrated Genome-Wide Pathway Association Analysis with INTERSNP" (Herold et al., 2012) describing the test in more details.
In order to test for association of a pathway with a phenotype the following parameters have to be specified. See also the example selectionfile "selectionfile_S6.txt".
PATHWAYANALYSIS 1 indicates that pathway association analysis shall be carried out. The keyword
PATHWAYFILE has to be followed by the path to the pathwayfile that shall be analyzed, for instance the
KEGG pathwayfile "KEGG_2_snp_b129.txt.ann" that can be found in the download section of the
homepage.
The various pathway scores are constructed from all SNP with a p-value smaller than a pre-specified
p-value p*. <singletop> has to be chosen accordingly. For instance, if p* = 0.05 is the desired threshold for
SNPs to be included into the pathway score and if 500000 QC-SNPs are in the .tped file, then
500000 * 0.05 = 25000 is the appropriate value for <singletop> (SINGLETOP 25000). Of course, the data set
may contain more SNPs with a p-value smaller then 0.05, but fixing the number 25000 instead protects
against inflation of type I error due to residual inflation of the single-marker test statistics. Note that all
single-marker tests can be combined with the pathway analysis. In particular, quantitative traits and
covariates can be handled.
Currently 5 different pathway scores are available, identified by the numbers 1 to 5.
Significance of the individuals’ pathways as well as after adjustment for testing multiple pathways is determined via Monte-Carlo simulations based on genome-wide permutation of case-control status or quantitative trait value. At least 3500 simulations are the minimum requirement to get experiment-wise significant results with the KEGG pathways. Rather then choosing 1000 or 10000 as the number of replicates we recommend 999 or 9999, respectively.
Here is an example of a selection file to perform the PAA with PATHWAYTEST 1. The selection files for PATHWAYTEST 2,3,4,5 are analogous.
Selection file for the interaction-Ratio (PATHWAYTEST 6)
INTERSNP-RARE is an extension build upon INTERSNP’s framework for genome-wide, permutation-based rare variant analysis. Currently, three rare variant testing procedures are implemented: Fisher-Rare - FR. based on the Fisher combination test (Fisher, 1925), COLLapsing test - COLL, (Li and Leal, 2008) and the Cumulative Minor Allele Test - CMAT, (Zawistowski et al., 2010). All tests are extended to their variable-threshold (VT) counterparts (keyword OPTIMAL). Whether a SNP is rare is determined by the parameter to the keyword RARE, which is interpreted as the threshold MAF for defining ”rareness”. MAF is calculated from the whole sample, irregardless of individuals’ affection status. In the case of a zero MAF, the SNP in question is homozygous and is not considered to be rare. In the literature, common values for defining rare SNPs are 0.01 or 0.05, though this may depend on the particular dataset.
Statistical testing of disease association is conducted on rare variants in physical proximity (”bins”). Thus, a physical position is required for each marker to be analyzed. The choice of a particular binning is not straightforward and has to be chosen taking various properties of the dataset into account. In INTERSNP-RARE, extensive support for bin assignment is provided.
Prior to elaborating on the details, it is necessary to fix our terminology.
Bin: a subset of discrete SNPs in physical proximity, taken from the set of all SNPs in a study. Bins can
be created algorithmically from genotype data. Automatic binning can be performed either by number of
SNPs (BINSIZE_SNP), by number of rare SNPs (BINSIZE_RARE) or by size of the region in base pairs
(BINSIZE_DIST).
Interval: a continuous region of a chromosome. Intervals are uniquely defined by a chromosome number, a start and an end position. Intervals may represent features like LD blocks or genes. Intervals are independent of any particular study data and are oblivious to any peculiarities therein. If intervals are provided, bins are created by projection of intervals on a dataset.
Intervals are supplied by an external file (keyword INTERVAL_IN). The maximal size for an interval is a whole chromosome. Pseudoautosomal regions and sex chromosomes are ideally treated as independent chromosomes.
Support for overlapping and non-overlapping bins and intervals is provided.
An overview of keywords for controlling rare variant testing are listed in the next section.
|
|
||
|
|
||
|
Keyword |
Example |
Description |
|
|
||
|
|
||
|
RARE |
0.05 |
Definition of rare MAF |
|
|
||
|
|
||
|
OPTIMAL |
1 |
1 for variable threshold analysis (VT) |
|
|
||
|
|
||
|
MAF_ADJUST |
1 |
1 (default): MAF is calculated from expected, (not observed) number of alleles. No effect if SNP missing rate=0. |
|
|
||
|
|
||
|
CHR_BOUNDARIES |
chromosome_coord_hg19.txt |
File with start and end positions of chromosomes and PARs |
|
|
||
|
|
||
|
BINSIZE_SNP |
4000 |
Number SNPs per bin (rare and common) |
|
|
||
|
|
||
|
BINSIZE_RARE |
50 |
Number rare SNPs per bin |
|
|
||
|
|
||
|
BINSIZE_DIST |
800000 |
Physical distance in bp per bin |
|
|
||
|
|
||
|
RARE_TESTS |
COLL;CMAT;FR; |
Tests to be conducted. Default: all |
|
|
||
|
|
||
|
WEIGHTS |
1/SD | BETA;1;25; | LOGISTIC;0.07;150; |
Weights of variants for non-collapsing tests |
|
|
||
|
|
||
|
LAMBDA_ADJUST |
1 |
Correction for genome-wide FR testing |
|
|
||
|
|
||
|
RARE_PRE_PRETEST |
0.0001 |
Pretest before simulations are conducted: eliminate bins below (conservative) threshold |
|
|
||
|
|
||
|
RARE_PRETEST |
100 0.1 |
Pretest for rare variant analysis. Takes two arguments: number of simulations and p-threshold for passing the pretest |
|
|
||
|
|
||
|
INTERVAL_IN |
Ensembl_Genes69_hg19.txt |
File containing intervals |
|
|
||
|
|
||
|
INTERVAL_FORMAT |
[Semicolon-separated list] |
Formatting of the interval file |
|
|
HEAD=”1”; |
Number of header lines |
|
|
SEPARATOR=”\t”; |
Field separator |
|
|
COLUMNS=”5”; |
Number of columns |
|
|
CHR=”1”; |
Column with chromosome number |
|
|
START=”2”; |
Column with interval start positions |
|
|
END=”3”; |
Column of interval end positions |
|
|
FEATURE=”4”; |
Column containing ”feature” |
|
|
DESCRIPTION=”5”; |
Column containing ”description” |
|
|
||
|
|
||
|
MERGE_INTERVALS |
1 |
1 for merging of overlapping input intervals |
|
|
||
|
|
||
|
FLANKING |
1000 |
Flanking of intervals in bp |
|
|
||
|
|
||
|
CONCATENATE_INTERVALS |
0 |
Number of consecutive intervals to be concatenated |
|
|
||
|
|
||
|
FILL_GAPS |
1 |
For a non-overlapping set of intervals: assign uncovered space to intervals |
|
|
||
|
|
||
|
FILTER_SMALL_BINS |
1 |
Filter bins: minimum required number of rare SNPs |
|
|
||
|
|
||
|
SPLIT_LARGE_BINS |
1 |
Filter bins: maximim allowed number of rare SNPs; bins will be split |
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
The keyword RARE_TESTS allows to choose tests to be conducted from one or more of the following: FR;CMAT;COLL;. The list has to be separated by semicolons. By default, all tests are selected.
The implemented Fisher-Rare test is based on Fisher’s combined probability test. The test statistic is determined by combining the N single-marker p-values of N rare variants in a particular bin

In our implementation, all weights are wj = 1 by default (see 7.4 on custom weights).
For fixed threshold analysis, the p-value is determined from the statistic by permutation of the case-control status (or quantitative trait), while keeping genotype data unchanged.
For independent p-values, the overall p-value may be determined via χ2-distribution with 2N degrees of freedom. Due to linkage disequilibrium (LD), the pi are not independent, which leads to the χ2-distribution being an anticonservative approximation. However, the χ2-distribution is useful for finding the optimal rare MAF by raising the threshold frequency and determining the overall p-value with the new set of rare SNPs, and taking the MAF for which the p-value is minimal. The ”true” p-value is calculated by permutation (see 7.5).
Setting the parameter of the keyword LAMBDA_ADJUST to 1 instructs INTERSNP to normalize singlemarker-pi to be normalized by dividing corresponding test statistics by the inflation factor. Use this option only with a large number of SNPs, i.e. in genome-wide association studies.
CMAT, the cumulative minor allele test (Zawistowski et al., 2010) utilizes the weighted sum (with weights wj) over F minor allele frequencies of rare variants in NA cases and NU controls

where mA, mU, MA and MU are weighted minor and major allele counts in affected and unaffected individuals




Here, Xij and Y ij are the number of minor alleles at the jth variant position for the ith case and control, respectively.
The p-values for the CMAT test are determined by permutation.
The collapsing test COLL (Li and Leal, 2008) dichotomizes individuals by the presence of any rare variant in a given bin

where X is the number of cases carrying a rare variant:

Xi are indicator variables for the ith case:

and analogous definitions of Y , Y i for controls.
Alternatively, the test statistic can be expressed as

The p-values of the COLL test are determined by the χ2-distribution with one degree of freedom in case of a fixed threshold and by permutation in case of a variable threshold.
The weights (default: wi = 1) can be used in non-collapsing tests to weight alleles by a pre-determined scheme that mirrors our estimation of their risk.
Currently, three weighting schemes are supported (see also Wu et al. (2011) and references therein).
1/SD. The weights are reciprocal to the standard deviation of allele counts
![]() |
BETA;a;b;. The weights are proportional to the Beta distribution (default: a = 1, b = 25)
![]() |
LOGISTIC;a;b;. Instead of strongly upweighting rare variants, logistic weights are rather a ”soft threshold” alternative to a fixed MAF threshold (default: a = 0.07, b = 150)
![]() |
The weights of rare variants are approximately 1 and for common variants 0.
The Variable Threshold (VT) analysis is activated using the keyword OPTIMAL. The extension to the variable
threshold model for COLL and CMAT tests is motivated by the principle described in (Price et al., 2010).
COLL, CMAT: with NSIM permutations, calculate Tmaxj for each value of MAFij ≤ MAFT contained for each permutation j. The proportion of those Tmaxj that are larger than the non-permuted Tmax0 is used to determine the corresponding p value:
![]() | (7.1) |
FR: The VT approach requires a modification in case of FR test. Since the test statistic depends on
a different number of p values at different MAFi levels, they can not be directly compared.
TFR,MAFi is also monotonically increasing with the value of MAF. The test statistic is thus not
comparable in variable threshold mode. A multiparametric function is required to map different TFR
on an interval taking the number of variants into account, that allows a comparison operator.
One of the suitable functions that maps TFR and allows comparison is the χ2 distribution with 2m degrees
of freedom. Although due to LD the distribution is, in general, anticonservative, it can be used since the
absolute pEst enter the calculation for being used for comparison (”<” and ”>”), at different MAF levels
only.
TFR′=p (the χ2-distribution with 2Ndf) instead of TFR is used to obtain optimal threshold. The p-value is determined by calculating the fraction of smaller minimal test statistics TFR′ across permutations and MAF levels:
![]() | (7.2) |
The FR method is therefore similar to COLL and CMAT VT-methods, except that maximal test statistics are compared at different MAF levels. Small test statistics are considered to be more improbable, which leads to a change of sign in Eq. (7.2) .
Both variable and fixed threshold analyses of COLL and CMAT return ”odds ratios” based on their respective contingency tables. They are useful for determining the direction of effect.
In a permutation framework, computational time and hardware requirements can be significantly reduced by conducting the full number of permutations only for bins that have a chance to come close to the significance level. This is achieved by stepwise analysis.
The first step (RARE_PRE_PRETEST) takes place before any permutations are conducted. We use the fact that the p-values obtained from test statistics TCOLL,TCMAT and TFR for the real dataset on a given bin are rather anticonservative, either due to LD (FR and CMAT) or due to small cell counts (COLL). Choosing a suitable passing threshold eliminates the majority of tests to be conducted.
To reduce computational time even further, we offer an implementation of a ”rare pretest” using a smaller number of simulations. The keyword RARE_PRETEST takes two parameters: an integer number of simulations, after which the passing condition is checked, and a floating point number equal to the threshold p-value for passing the pretest. After the given number of simulations, each window and each individual test is checked for the passing condition.
If the condition for pretests are not satisfied, the p-value will be marked with an asterisk (*) in the output file. All tests are pre-tested independently.
In addition to algorithmic binning procedures, extensive functionality for generating and modifying bins created from user-supplied intervals is provided. Bins based on functional classes or statistical characteristics can be generated from a multitude of sources: LD intervals, conserved elements, gene annotation files and more. This set of features is also helpful for validating bins in a different dataset.
Intervals are supplied by the user in a file, one interval per line, by keyword INTERVAL_IN. Obviously, INTERVAL_IN and the three algorithmic binning methods are mutually exclusive.
We provide an intervalfile (Ensembl_Genes69_hg19.txt) based on a list of coding genes imported from Ensembl using BioMart (http://www.ensembl.org/biomart/martview). The file was created using attributes ”Chromosome Name”, ”Gene Start (bp)”, ”Gene End (bp)”, ”Associated Gene Name”, ”Description”, and exported in the *.tsv format. The resulting file was sorted with respect to chromosome name and gene start.
The formatting of the interval file is defined in a string INTERVALFILE_FORMAT. Required arguments are COLUMNS, CHR, START and END, which give the overall number of columns, and the columns containing start and end positions as well as chromosome names in the file. Columns are counted starting from one. INTERSNP accepts different chromosome naming conventions, for instance ”CHR26”, ”chrMT”, ”M” will be recognized as chromosome 26.
Two additional fields from an interval file can be specified by the user, first of which is FEATURE. A feature is handled as a character string and serves the purpose of tracking intervals more easily. A feature might be a gene name or name of an LD block. The feature column is carried all the way through the calculation and will be appended to the output file. In case that intervals undergo modification, the feature column is modified as well. For instance, when consecutive fields are merged or concatenated, their respective features are put in one field. The second optional field is DESCRIPTION. As the name suggests, it is a character string describing the feature, for example gene function. In contrast to the feature column, the description will appear in the output file only if intervals are not merged or concatenated.
Other optional arguments to INTERVAL_FORMAT include HEAD for the number of header lines to be
skipped. SEP defines the field separator. Currently, tab ("
), space ("SPACE") and comma (",") are
supported as field separators.
The correct interval file format string for the provided file (Ensembl_Genes69_hg19.txt) is
HEAD=”1”;SEPARATOR=”
’;COLUMNS=”5”;CHR=”1”;START=”2”;END=”3”;FEATURE=”4”;DESCRIPTION=”5”;
It is advised to provide physical limits of chromosomes in a tab-delimited file consisting of chromosome name, start and end position when using an interval file. Positions of pseudoautosomal regions are provided using contig names PAR1_X, PAR2_X, PAR1_Y and PAR2_Y, using X- and Y-chromosomal coordinates, respectively.
This will provide consistency checks for input intervals. Besides, it will guarantee correct handling of pseudoautosomal regions. Some genomic features are known to span pseudo- and non-pseudoautosomal regions of sex chromosomes (XG gene, for example). These features will be split and handled separately. Coordinates of features on the pseudoautosomal region of the Y chromosome will be transformed to X-coordinates and put on chromosome 25. Providing coordinates of pseudoautosomal regions will also ensure separate handling of both PARs.
We provide two coordinate files for build 36 and 37:
chromosome_coord_hg18.txt
chromosome_coord_hg19.txt
The files were manually created using the Ensembl genome browser. Due to complexities, the conventions for beginnings and ends of chromosomes may differ. We choose them as start- end endpoints of the first and last canonical positions of euchromatic chromosomal bands, as referenced in the Ensembl database.
In case that some of the intervals from the input file are overlapping, setting MERGE_INTERVALS to 1 will instruct INTERSNP to merge them. In case that intervals have identical coordinates, the default behaviour is to dismiss all but the first of these intervals.
Providing a parameter to FLANKING will instruct INTERSNP to extend intervals in both directions by that many base pairs. If a CHR_BOUNDARIES file is provided, limits given therein will be respected.
If the intervals do not cover the genome in a continuous way, some of the markers in a dataset may be found in the space between intervals and will be dismissed from rare variant analysis. If the provided intervals are non-overlapping or have been merged, this problem can be countered by setting FILL_GAPS to 1. The space in between intervals will be divided in half and each part will be assigned to the nearest interval.
In case that the provided intervals are found to be too small, one can set CONCATENATE_INTERVALS to the number of consecutive intervals to be concatenated into one interval.
Some bins may cover only few rare variants, especially when intervals are small and data sparse. In that case, INTERSNP will unnecessarily process useless bins. In addition, the number of bins will reduce power due to correction for multiple testing. This can be countered by setting FILTER_SMALL_BINS to the minimum number of rare variants per bin. Bins not satisfying this condition will be simply dismissed.
Some bins may cover hundreds or thousands of rare variants. It is advised to set a maximal number of rare variants per bin using the keyword SPLIT_LARGE_BINS to prevent disproportionately high computational burden, in particular when using variable-threshold models.
Bins that are found to contain more rare SNPs than the given limit will be split in bins containing the same number of rare variants, if possible.
MAF_ADJUST is a keyword that controls a simple recalculation of MAFs in SNPs with a nonzero missing rate. The default value of the parameter is 1.
The correction is intended to counter the following problem: the smallest possible MAF for a
non-monoallelic SNP is 1∕N, where N is the number of observed alleles. If M alleles are missing,
the value of the MAF is somewhat larger. Thus, rare SNPs with same allele counts may have
slightly different MAFs. In particular, the value of the observed minor allele frequency of an
autosomal SNP i with Bi observed minor allele counts and
missing genotypes in a dataset
with

allele counts and
individuals is given by

If Mi = 0, the set of possible MAFs is given by

With Mi≠0, the cardinality of the possible MAFs can be much larger due to different possible values created by the denominator (N - Mi)≠N. To transform the MAF of an observed allele to the MAF of expected minor allele counts one obtains the expected minor allele counts by

where

The adjusted MAFs are obtained from
![]() | (7.3) |
For rare variant analysis of data containing missing genotypes, it is usually useful to reduce the number of possible values of MAFi. This requires two assumptions: first, that genotyping errors do not correlate with the true allele, and that the missing rates MRi and minor allele frequencies MAFi are small enough. In this case, one can adjust the corresponding MAFi as
![]() | (7.4) |
This is the correction used by MAF_ADJUST 1 for (pseudo-)autosomal chromosomes with analogous equations for sex chromosomes and mitochondrial DNA. It can be seen from (7.3) that the condition ’small’ MAFi and MRi can be derived from the condition that the adjusted number of counts Bi′ is greater than Bi by n allele counts:

In other words, if the condition

is satisfied, the expected counts of minor alleles coincide with the number of observed alleles and (7.4) holds, otherwise it is probable that some of the missing genotypes contain minor alleles.
Note that we use the corrected MAF only for the purpose of reducing levels in variable threshold analysis (7.5) and calculation of weights (7.4).
Test results and additional information is written to the *Rare.txt file, which is a fully compatible interval file.
Columns without relevance to the particular analysis are omitted.
The quality-controlled, modified, sorted intervalfile (INTERVAL_IN) is written to a new interval file in the working directory.
Armitage P. (1955) Tests for linear trends in proportions and frequencies. Biometrics 11(3): 375–386.
Becker T., Herold C., Meesters C., Mattheisen M., Baur M.P. (2011) Significance levels in genome-wide interaction analysis (GWIA). Ann. Hum. Genet. 75(1): 29–35.
Bishop Y.M., Fienberg S.E., Holland P.W. (2007) Discrete Multivariate Analysis. Theory and Application. Springer, Berlin.
Clayton D. (2008) Testing for association on the X chromosome. Biostatistics 9(4): 593–600.
Cordell H. and Clayton D. (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am. J. Hum. Genet. 70(1): 124–141.
Cordell H.J. (2009) Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 10(6): 392–404.
Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver and Boyd (Edinburgh). ISBN 0-05-002170-2.
Gatto N.M., Campbell U.B., Rundle A.G., Ahsan H. (2004) Further development of the case-only design for assessing gene-environment interaction: evaluation of and adjustment for bias. Int. J. Epidemiol. 33(5): 1014–1024.
Ge Y., Dudoit S., Speed T.P. (2003) Resampling-based multiple testing for microarray data analysis. Test 12(1): 1–77.
Herold C., Steffens M., Brockschmidt F.F., Baur M.P., Becker T. (2009) INTERSNP: genome-wide interaction analysis guided by a priori information. Bioinformatics 25(24): 3275–3281.
Herold C., Mattheisen M., Lacour A., Vaitsiakhovich T., Angisch M., Drichel D., Becker T. (2012) Integrated genome-wide pathway association analysis with INTERSNP. Hum. Hered. 73(2): 63–72.
Holmans P., Green E.K., Pahwa J.S., Ferreira M.A., Purcell S.M., Sklar P., Wellcome Trust Case-Control Consortium, Owen M.J., O’Donovan M.C., Craddock N. (2009) Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am. J. Hum. Genet. 85(1): 13–24.
Kanehisa M., Goto S., Hattori M., Aoki-Kinoshita K.F., Itoh M., Kawashima S., Katayama T., Araki M., Hirakawa M. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34: 354–357.
Li B., Leal S.M. (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83(3): 311–321.
O’Dushlaine C., Kenny E., Heron E.A., Segurado R., Gill M., Morris D.W., Corvin A. (2009) The SNP ratio test: pathway analysis of genome-wide association datasets. Bioinformatics 25(24): 2762–2763.
Piegorsch W.W., Weinberg C.R., Taylor J.A. (1994) Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat. Med. 13(2): 153–162.
Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3): 559–575.
Price, AL., Kryukov, GV., de Bakker, PIW., Purcell, SM., Staples, J., Wei, LJ., Sunyaev, SR. (2010) Pooled Association Tests for Rare Variants in Exon-Resequencing Studies Am. J. Hum. Genet. 86(6) 832-838
Wang K., Li M., Bucan M. (2007) Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet. 81(6): 1278–1283.
Wu M.C., Lee S., Cai T., Li Y., Boehnke M., Lin X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89:82–93.
Yang Q., Khoury M.J., Sun F., Flanders W.D. (1999) Case-only design to measure gene-gene interaction. Epidemiology 10(2): 167–170.
Zawistowski M., Gopalakrishnan S., Ding J., Li Y., Grimm S., Zöllner S. (2010) Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am. J. Hum. Genet. 87(5): 604–617.