______________________________________________________________________________________

INTERSNP

Genome-wide Interaction Analysis
_________________________________________________________________________________

Christine Herold
Tim Becker
Dmitriy Drichel
André Lacour
Tatsiana Vaitsiakhovich
Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE)
German Center for Neurodegenerative Diseases
Bonn

Institute for Medical Biometrics, Informatics and Epidemiology (IMBIE)
University of Bonn

Bonn, May 3, 2013

Contents

1 Introduction
2 Using INTERSNP
 2.1 Usage
  2.1.1 Command line
  2.1.2 Selection file
 2.2 Input files
  2.2.1 SNP/Pedigree/Genotype files
  2.2.2 Annotation file
  2.2.3 Pathway file
  2.2.4 Covariate file
  2.2.5 Model file
  2.2.6 Combi file
  2.2.7 SNP file
 2.3 Options
 2.4 Output files
  2.4.1 Output single marker
 2.5 Produce SNP-Proxy-Files for Meta-Analysis with YAMAS
3 Examples
 3.1 Genome-wide Interaction Analysis with Pre-test
 3.2 Strategy I
 3.3 Strategy II
 3.4 Strategy III
 3.5 Strategy IV
 3.6 Strategy V
 3.7 Genome-wide Haplotype Analysis
4 Tests in INTERSNP
 4.1 Single-marker tests
 4.2 Two- and three-marker analysis
  4.2.1 χ2-test for full genotype contingency table (TEST 1)
  4.2.2 Test for interaction using a log-linear model (TEST 2)
  4.2.3 Logistic regression (TESTS 3-12)
   4.2.3.1 Parameter coding
   4.2.3.2 Likelihoods
  4.2.4 Linear regression
   4.2.4.1 Tests (2 SNPs)
   4.2.4.2 Tests (3 SNPs)
   4.2.4.3 X-chromosome
   4.2.4.4 Covariates
   4.2.4.5 User-defined regression models
  4.2.5 Pre-tests
   4.2.5.1 Pre-test allelic interaction (logistic regression)
   4.2.5.2 Pre-test genotypic interaction (logistic regression)
   4.2.5.3 Pre-test genotypic interaction (linear regression)
   4.2.5.4 Pre-test allelic interaction (linear regression)
5 Case-only analysis
 5.1 Case-only allelic interaction test (TEST 15)
 5.2 Case-only SNP-SNP interaction test (TEST 16)
 5.3 Selection file for case-only analysis
6 Pathway association analysis
 6.1 Pathway scores
 6.2 Simulation
 6.3 Selection file PAA
7 INTERSNP-RARE:
Genome-Wide Rare Variant Analysis

 7.1 Introduction
 7.2 Options
 7.3 Statistical Tests
  7.3.1 FR
  7.3.2 CMAT
  7.3.3 COLL
 7.4 Weights
 7.5 Variable Threshold
 7.6 Odds Ratios
 7.7 Rare Pretests
 7.8 Advanced Binning Methods
  7.8.1 User-Supplied Intervals
  7.8.2 Chromosomal Boundaries
  7.8.3 Interval Modification
   7.8.3.1 Merging of Input Intervals
   7.8.3.2 Flanking
   7.8.3.3 Fill Gaps
   7.8.3.4 Concatenating Intervals
  7.8.4 Size of Bins: Quality Control
   7.8.4.1 Filter Small Bins
   7.8.4.2 Maximum Number of Rare Variants
 7.9 Imputation of Minor Allele Frequencies for Missing Genotypes
 7.10 Output
  7.10.1 Association Testing Results: *Rare.txt
  7.10.2 Modified Interval File: *_modified.txt
Bibliography

Chapter 1
Introduction

INTERSNP (Herold et al.2009) is a software for GWAS multi-marker analysis, especially interaction analysis (GWIA), of case-control SNP data and quantitative traits. SNPs are selected for joint analysis using a priori information. Sources of information to define meaningful strategies can be statistical evidence (single marker association at a moderate level, computed on the fly from the data) and genetic/biologic relevance (genomic location, function class or pathway information). Our software product implements a logistic regression framework as well as log-linear models for joint analysis of multiple SNPs, automatic handling of SNP annotation and pathway information and methods to account for multiple testing, in particular, Monte-Carlo simulations (MC-simulations) to judge genome-wide significance. Additionally, different pathway association tests are provided and it is also possible to perform a genome-wide haplotype analysis (GWHA).



NOTE: Single-marker analysis is always carried out by default.

Chapter 2
Using INTERSNP

2.1 Usage

2.1.1 Command line

INTERSNP is written in C/C++ and can be operated from the command line.

Compile: g++ intersnp.cpp -o intersnp -O3.

Compile parallel version: g++ intersnp.cpp -o intersnpParallel -O3 -fopenmp.

There are two main loops which can be parallelized in INTERSNP. To compile a parallel version of INTERSNP it’s necessary to modify the source code and to change either "#define PARALLELN 0" or "#define PARALLELA 0". In order to parallelize the n-loop used for MC-simulations, in particular pathway association analysis, change "#define PARALLELN 0" to "#define PARALLELN 1". To do a full Genome-wide interaction analysis (GWIA) it’s recommended to parallelize the A-loop, change "#define PARALLELA 0" to "#define PARALLELA 1".

Run: ./intersnp selectionfile.txt

Run parallel version: ./intersnpParallel selectionfile.txt

2.1.2 Selection file

Various analysis options (statistical test, priorities, ...) can be specified by the user in a selection file. The file consists of three columns: keyword, parameter, and optionally comments, and is separated by whitespaces (tab, blank). Further columns are ignored.

To get familiar with the selection file, first of all a short example is introduced which will be described in detail.

Keyword Parameter Comment 
TPED    ./test.tped  // path to the .tped file 
TFAM    ./test.tfam  // path to the .tfam file 
ANNOTATIONFILE./annotation_file.txt  // path to the annotation file 
HWE_P_CASE  0.001  // QC-threshold for HWE in cases 
HWE_P_CONTROL  0.01  // QC-threshold for HWE in controls 
MRDIFF  1  // QC-threshold missing rate (mr): individuals and SNPs 
              worse than the average mr + MRDIFF will be deleted 
MAF        // QC-threshold minor allele frequency 
SINGLE_MARKER  1  // Armitages trend test 
TWO_MARKER  1  // two-marker analysis 
TEST  5  // logistic regression model: 
            test for additive interaction 
SINGLETOP  50000 // top 50000 single-marker results 
M_WITH_SINGLETOP  2  // all pairs of SNPs which are among 
                        the top 50000 single-marker results 
GENETIC_IMPACT  3  // genetic impact: within coding region 
M_WITH_GENETIC_IMPACT  2  // all pairs of SNPs which lie in a 
                            coding region of a gene 
OUTPUTNAME    ./myproject/test  // specify the output name 
END

TPED is the path to the .tped input file and TFAM is the path to the .tfam file. In this example we also use an annotation file (ANNOTATIONFILE) to get the genetic information. After the input files we define some quality control (QC) criteria to delete individuals and SNPs not satisfying the QC-criteria of HWE_P_CASE and HWE_P_CONTROL, MRDIFF (missing rate) and MAF (minor allele frequency). Every analysis starts with a single-marker test which can be specified by SINGLEMARKER. In our example we use Armitage’s trend test (SINGLEMARKER 1). Furthermore, we would like to do a two-marker analysis, so we select TWO_MARKER 1 and choose TEST 5, which is a test for additive interaction (see section 4.1). To reduce the number of tests we use the statistic and genetic criteria. SINGLETOP 50000 and M_WITH_SINGLETOP 2 means that we analyze only pairs of SNPs which are among the top 50000 single-marker results. In addition, these pairs should also be in a coding region of a gene (GENETIC_IMPACT 2, M_WITH_GENETIC_IMPACT 3). Finally, we specify the output name (OUTPUTNAME).

All the input files and options of the complete selection file below will be described in the following sections.

KeywordParameter Comment 
# data files 
FILE        // Path to SNP/Pedigree/Genotype files (.map/.ped) 
TFILE       // Path to SNP/Pedigree/Genotype files (.tfam/.tped) 
BFILE       // Path to SNP/Pedigree/Genotype files (.fam/.bim/.bed) 
MAP     ./test.map   // Path to the .map file 
PED     ./test.ped   // Path to the .ped file 
FAM     ./test.fam   // Path to the .fam file 
BMAP    ./test.bim   // Path to the .bim file 
BPED    ./test.bed   // Path to the .bed file 
TFAM    ./test.tfam  // Path to the .tfam file 
TPED    ./test.tped  // Path to the .tped file 
ANNOTATIONFILE  ./annotationfile.txt  // Path to the annotation file 
PATHWAYFILE     ./pathwayfile.txt  // Path to the pathway file 
COVARIATEFILE   ./covariatefile.txt  // Path to the covariate file 
MODELFILE       ./modelfile.txt  // Path to the model file 
SNPFILE         ./snpfile.txt  // Path to the SNP file 
COMBIFILE       ./combifile.txt  // Path to the combi file 
# Selection/reduction of data 
SNPLIST     0   // Only SNPs from the SNP file will be analyzed: 
                   1=yes, 0=no 
COMBILIST   0   // Only SNPs pairs/triples from the combi file will be 
                   analyzed: 1=yes, 0=no 
ONLY_MALE   0   // Only male persons will be analyzed: 1=yes, 0=no 
ONLY_FEMALE 0   // Only female persons will be analyzed: 1=yes, 0=no 
POSCHOICE  // Selection of certain regions (ONLY the selected region 
              will be analyzed); example:chr4;chr12,123000-160000;chr24; 
NEGCHOICE  // Exclusion of certain regions (these regions will NOT 
              be analyzed) 
MATCHINGCHOICE  //  Selection of certain regions for pairwise matching 
QT    0  // Use the analysis of quantitative traits: 1=yes, 0=no 
MISSING_PHENO   -9   // Definition of the missing phenotype 
LD_DISTANCE 500000 // Define the LD distance between SNPs pairs, in bp 
# Quality control 
HWE_P_CASE    0.001  // QC-threshold for HWE in cases 
HWE_P_CONTROL  0.01   // QC-threshold for HWE in controls 
MRDIFF    0.1  // QC-threshold missing rate (mr): individuals and SNPs 
                worse than the average mr + MRDIFF will be deleted 
MAF    0   // SNPs with lower MAF are deleted 
IBS_SD_RELATIVES    // SD-limit for relatives 
IBS_SD_OUTLIER      // SD-limit for outlier 
DESTRATIFY    0    // Handling stratification: 0=off, 1=matched pairs 
# Single- and multi-marker analysis 
SINGLE_MARKER 1 // 1=Armitages trend test, 2=genotype test 2 d.f., 
               3=logistic regression 1 d.f., 4=logistic regression 2 d.f. 
TWO_MARKER  1  // Two-marker analysis: 1=yes, 0=no 
THREE_MARKER  0  // Three-marker analysis: 1=yes, 0=no 
PRETEST  0  // Pre-test: 1=yes, 0=no 
PRETEST_CUTOFF   0.05   // Pre-test threshold 
TEST 2 //1=chi-square test, 2=log-linear model, 3-12=logistic regression, 
      15=case-only allelic interaction, 16=case-only SNP-SNP interaction, 
      M=user-defined logistic regression model 
COVARIATES  // Covariates: all covariates 1-10; or, for example, 1;8-10; 
SEXCOV  0   // sex as a covariate: 1=yes, 0=no 
# Priorities 
SINGLETOP   500  // Length n of the single-marker p-values toplist 
M_WITH_SINGLETOP  2  // number of SNPs (0,1,2,3) which shall be selected 
                        from the single-marker p-values toplist 
GENETIC_IMPACT 1 //Genetic impact:0=none (gene desert), 1=close to gene, 
                 2=within gene, 3=within coding region, 4=non-synonymous 
M_WITH_GENETIC_IMPACT  1 // Number of SNPs (0,1,2,3) which shall have 
                            at least the selected genetic impact 
SNP1    // Fix first SNP for analysis; specify rs number 
SNP2    // Fix second SNP for analysis; specify rs number 
SNP3    // Fix third SNP for analysis; specify rs number 
PATHWAY    0   // Include pathway information: 1=yes,  0=no 
# Pathway Association Analysis 
PATHWAYANALYSIS    0  // Pathway association analysis with one of the six 
                      pathway association tests 
PATHWAYTEST 2 // 1=SNP ratio, 2=Fisher score, 3=gene ratio, 
                 4=Fisher Max, 5=Fisher MaxPlus, 6=interaction ratio 
P_INTERRATIO  0.5  // p-value cutoff for TEST 6 
# Genome-wide Haplotype Analysis 
HAPLO  0   // Genome-wide Haplotype Analysis (GWHA): 1=yes, 0=no 
HAPLO_DIST 50000 // Define the haplotype distance 
                    for SNPs pairs/triples, in base pairs (bp) 
DOHAPFILE       // Produce proxy-file for YAMAS 
# MC-Simulation 
SIMULATION 0 // Number of MC-simulations (0=no simulations) 
MC_WITH_SM 0 // 1=MC-simulation for multi-marker AND single-marker test, 
                0=MC-simulation only for multi-marker tests 
# Output 
PRINTTOP    100 // Best n multi-marker p-values are printed 
PFILTER         // p-value cutoff for multi-marker analysis 
OUTPUTNAME ./myproject/test // Specify the output name 
ANNOTATE  // Add annotation information in the output files 
GENECOL   // Specify column number of gene in the annotation information 
END

2.2 Input files

2.2.1 SNP/Pedigree/Genotype files

Possible data input formats are the PLINK (Purcell et al.2007) formats (.ped/.map; .tped/.tfam; .bed/.bim/.fam).

For instance the .tped file (TPED) contains SNP and genotype information, where each row represents a SNP and two columns represent genotypes per individual. The first four columns are chromosome, marker ID, genetic distance and base pair position.

16 rs8466895 0 37354 T T T T T T C C C T 
16 rs216590 0 41263 G A G A A A A A G A 
16 rs216596 0 45320 A G G G G G G G A G 
16 rs2541594 0 45444 C T T T C T C C C C 
16 rs8466 0 49427 T T T T T T C C C T 
16 rs216590 0 52259 G A G A A A A A G A 
...

NOTE: To improve running time it it recommended to sort the file by chromosome and position.
The .tfam file (TFAM) contains family information, each row represents a person. To create a .tfam file use the first six columns of a .ped file: Family ID, Individual ID, Father, Mother, Sex and Affection status.

co1 co1 0 0 1 1 
co2 co2 0 0 2 1 
... 
ca1 ca1 0 0 2 2 
ca2 ca2 0 0 2 2 
...

For more details on the data formats visit the PLINK website http://pngu.mgh.harvard.edu/ purcell/plink/data.shtml.

2.2.2 Annotation file

To work with genetic criteria a SNP annotation file (ANNOTATIONFILE) is needed. We use the same format as in the Illumina Human610 chip annotation file. The first columns represent rs number, chromosome, base pair position and genome_build number; all the other columns show detailed information. For our different categories the following columns are important:

The file can be downloaded from our homepage (http://intersnp.meb.uni-bonn.de/ Downloads).

name;chr;coordinate;genome_build;gene_symbol;gene_id;accession;location; 
location_relative_to_gene;coding_status;amino_acid_change;id_with_mouse; 
phast_conservation 
rs12354060;1;10004;36.2;LOC653635;653635;XR_017611.1;intron;-1762; 
NULL;NULL;NULL;NULL 
rs2691310;1;46844;36.2;LOC642894;642894;XR_016145.1;flanking_5UTR;-672; 
NULL;NULL;NULL;NULL 
rs2531266;1;59415;36.2;OR4F5;79501;NM_001005484.1;coding;[461/456]; 
SYNON;A154A (NP_001005484.1);0.64;0.979 
rs4124251;1;97215;36.2;LOC727901;727901;XR_015157.1;3UTR;[3300/491]; 
NULL;NULL;NULL;NULL 
rs8179466;1;224176;36.2;LOC728481;728481;XR_015292.1;intron;-42;NULL; 
NULL;NULL;NULL 
...

NOTE: GENETIC_IMPACT and M_WITH_GENETIC_IMPACT have to be selected.
NOTE: To improve running time it is recommended to sort the file by chromosome and position.

2.2.3 Pathway file

To include pathway information a pathway file (PATHWAYFILE) is needed. A file containing KEGG (Kanehisaet al.), (O’Dushlaine et al.2009) pathways1 can be downloaded from our homepage (Downloads). We use the KEGG_2_SNP_229.txt file with genes annotated by us. The columns present pathway name, rs number and gene.

Pathway rsNo    Gene 
hsa00010 rs61487361 HK2 
hsa00010 rs2286168  ALDH3B1 
hsa00010 rs41275697 ADH1B 
hsa00010 rs191970 PGM1 
hsa00010 rs7583259 GALM 
...

NOTE: PATHWAY has to be selected (1).
NOTE: To improve running time it is recommended to sort the file by pathway name.

2.2.4 Covariate file

To include own covariates into the analysis use the covariate file (COVARIATEFILE). The first two columns represent Family ID and Person ID and then up to 10 covariates are allowed.

FID PID COV1 COV2 
co1 co1 1.31 5.75 
co2 co2 7.24 5.97 
... 
ca1 ca1 1.85 5.12 
ca2 ca2 2.36 6.42 
...

NOTE: Covariates are selected with the COVARIATES keyword. If all covariates shall be used, write 1-10; (semicolon is important)
NOTE: Missing data value is "–" or "x".

2.2.5 Model file

For user-defined logistic/linear regression models that are not included in the list of tests, the model file (MODELFILE) is needed. The first column presents the name of the parameter followed by 0/1 indicators for the first and the second likelihood. The indicators define which of the compared likelihoods L1 and L2 are used.

PARAMETER L1  L2 
x1 1 0 
x1D 0 0 
x2  0 0 
x2D 0 0 
x1x2  0 0 
x1x2D 0 0 
x1Dx2 0 0 
x1Dx2D  0 0 
x3  0 0 
x3D 0 0 
x1x3  0 0 
x1x3D 0 0 
x1Dx3 0 0 
x1Dx3D  0 0 
x2x3  0 0 
x2x3D 0 0 
x2Dx3 0 0 
x2Dx3D  0 0 
x1x2x3  0 0 
x1x2x3D 0 0 
x1x2Dx3 0 0 
x1x2Dx3D  0 0 
x1Dx2x3 0 0 
x1Dx2x3D  0 0 
x1Dx2Dx3  0 0 
x1Dx2Dx3D 0 0

NOTE: TEST has to be set to M.

2.2.6 Combi file

To analyse only user-specified SNP pairs or triples a combi file (COMBILIST) is required. Each row represents one combination.

rs11248850 rs7190878 
rs7404049 rs4984707 
rs11649498 rs1040499 
...

NOTE: COMBILIST has to be set to 1.

2.2.7 SNP file

To analyse all combinations of user-specified SNPs from the .tped file a SNP file (SNPFILE) is required. Each row represents one SNP.

rs11248850 
rs7190878 
rs7404049 
rs4984707 
rs11649498 
...

NOTE: SNPLIST has to be set to 1.

2.3 Options

Keyword

Example

Default

Description

FILE

Path to SNP/Pedigree/Genotype files. The input format is the standard PLINK format (.ped/.map) (section 2.2.1)

TFILE

Path to SNP/Pedigree/Genotype files. The input format is the transposed PLINK format (.tped/.tfam) (section 2.2.1)

BFILE

Path to SNP/Pedigree/Genotype files. The input format is the binary PLINK (format (.bed/.bim/.fam) section 2.2.1)

FAM / TFAM / BFAM

./test.fam

Path to the .fam/.tfam file

MAP

./test.map

Path to the .map file

BMAP / BIM

./test.bim

Path to the .bim file

PED

./test.ped

Path to the .ped file

TPED

./test.tped

Path to the .tped file

BPED / BED

./test.bed

Path to the .bed file

ANNOTATIONFILE

./annotationfile.txt

Path to the annotation file (section 2.2.2). Only needed if genetic criteria is used

NOTE: GENETIC_IMPACT and M_WITH_GENETIC_IMPACT have to be selected

NOTE: To improve running time it is recommended to sort the file by chromosome and position

PATHWAYFILE

./pathwayfile.txt

Path to the pathway file (section 2.2.3)

NOTE: PATHWAY has to be switched on 1

NOTE: To improve running time it is recommended to sort the file by pathway name

COVARIATEFILE

./covariatefile.txt

Path to the covariate file (section 2.2.4). Only needed if covariates are included into the analysis

NOTE: Covariates are selected with the COVARIATES keyword (see below). If all covariates are to be used, write 1-10;

NOTE: Missing data value is "–" or "x"

MODELFILE

./modelfile.txt

Path to the model file (sections 2.2.5, 4.2.3). Only needed for user-defined logistic regression models that are not included in the list of tests

NOTE: TEST has to be switched on M

SNPFILE

./snpfile.txt

Path to the SNP file (section 2.2.7)

NOTE: SNPLIST has to be put on 1

COMBIFILE

./combifile.txt

Path to the combi file (section 2.2.6)

NOTE: COMBILIST has to be put on 1

SNPLIST

0

Only SNPs from the SNP file will be analyzed: 1=yes, 0=no

COMBILIST

0

Only SNPs pairs/triples form the combi file will be analyzed: 1=yes, 0=no

ONLY_MALE

0

Only male persons will be analyzed: 1=yes, 0=no

ONLY_FEMALE

0

Only female persons will be analyzed: 1=yes, 0=no

POSCHOICE

chr4; chr12,123000-160000; chr24;

Selection of certain regions (ONLY the stated regions will be analyzed)

NOTE: Use the notation shown in the example

NEGCHOICE

chr4; chr12,123000-160000; chr24;

Exclusion of certain regions (The stated regions will NOT be analyzed)

NOTE: Use the notation shown in the example

QT

0

Use the analysis of quantitative traits: 1=yes, 0=no (case-control)

MISSING_PHENO

-9

-99/0

Specify the missing phenotype value

By default it is -99 for QT 1, and is 0 for QT 0

NOTE: When using QT 1, make sure that the missing phenotype value is defined

LD_DISTANCE

Define the LD distance between SNP pairs, in base pairs (bp)

HWE_P_CASE

0.001

1

QC-threshold for HWE in cases (p-value cutoff)

HWE_P_CONTROL

0.01

1

QC-threshold for HWE in controls (p-value cutoff)

MRDIFF

0.02

1

QC-threshold missing rate (mr): individuals and SNPs worse than the average mr + MRDIFF will be deleted (iterative algorithm)

MAF

0

SNPs with lower MAF will be deleted

SINGLE_MARKER

1=Armitage’s trend test, 2=genotype test with 2 d.f., 3=logistic regression with 1 d.f., 4=logistic regression with 2 d.f.

TWO_MARKER

0

Change parameter from 0 to 1 to perform two-marker analysis

NOTE: It’s not possible to select two-marker and three-marker analysis together

THREE_MARKER

0

Change parameter from 0 to 1 to perform three-marker analysis

NOTE: It’s not possible to select three-marker and two-marker analysis together

PRETEST

1

Pre-test is selected: 1=yes, 0=no

PRETEST_CUTOFF

0.05

Pre-test threshold. If Pre-test yields p-value <= cutoff, TEST is conducted

TEST

Multi-marker tests: 1=chi-square test on contingency table (8 d.f.), 2=log-linear model (4 d.f., test FOR interaction), 3-12=logistic regression models, 15=case-only allelic interaction, 16=case-only SNP-SNP interaction, M=User-defined logistic regression model (MODELFILE). Detailed description are presented in the model table

NOTE: tests 1-2 are possible for two-marker and three-marker analysis, tests 3-8 are used only for two-marker analysis and tests 9-12 are only for three-marker analysis

COVARIATES

1-5;7;8;

Select the covariates which are to be considered, separate by “;”, condense with “-”. Numbers must be put in ascending order.

NOTE: It’s only possible to use this option if the path to the covariate file (keyword: COVARIATEFILE) is declared.

SEXCOV

Sex as a covariate: 1=yes, 0=no. Can be selected without a covariate file

SINGLETOP

1000

Based on these single-marker p-values (Armitage’s trend test or genotype test with 2 d.f.), a list of n top SNPs will be computed. The length can be specified with this option

M_WITH_SINGLETOP

2

Number of SNPs (0,1,2,3) which have to be taken from the single-marker p-values toplist

NOTE: SINGLETOP has to be specified

NOTE: M_WITH_SINGLETOP 3 is only possible when THREE_MARKER 1

GENETIC_IMPACT

3

Genetic impact: 0=none (gene desert), 1=close to gene, 2=within gene, 3=within coding region, 4=non-synonymous

NOTE: To use this option the path to the annotation file (keyword: ANNOTATIONFILE) has to be declared

M_WITH_GENETIC_IMPACT

2

Number of SNPs (0,1,2,3) involved in the analysis which shall have at least the selected genetic impact

NOTE: GENETIC_IMPACT has to be specified

NOTE: M_WITH_GENETIC_IMPACT 3 is only possible when THREE_MARKER 1

SNP1

Fix first SNP for analysis, specify rs number

SNP2

Fix second SNP for analysis, specify rs number

SNP3

Fix third SNP for analysis (possible only when THREE_MARKER 1), specify rs number

PATHWAY

Pathway information is used as a prior for interaction analysis: 1=yes, 0=no

NOTE: To use this option the path to the pathway file (keyword: PATHWAYFILE) has to be declared

PATHWAYANALYSIS

0

Pathway association analysis with one of the six pathway association tests (chapter 6)

NOTE: Choose, for instance, a number of SNP, multiply it by 0.05 (50000×0.05 = 25000 SINGLETOP 25000). Simulations should be put on at least 999

NOTE: To use this option the path to the pathway file (keyword: PATHWAYFILE) has to be declared

PATHWAYTEST

2

1=SNP ratio, 2=Fisher score (all significant SNPs),

3=gene ratio (ALIGATOR), 4=Fisher Max (best SNP per gene),

5=Fisher MaxPlus

NOTE: PATHWAYANALYSIS has to be selected.

NOTE: To use this option the path to the pathway file (keyword: PATHWAYFILE) has to be declared

P_INTERRATIO

p-value cutoff for TEST 6

HAPLO

Genome-wide Haplotype Analysis (GWHA) for all SNPs pairs/triples less than HAPLO_DIST apart

NOTE: To use two-marker select TWO_MARKER 1, for three-marker select THREE_MARKER 1

NOTE: To use this option the following options have to be selected: HAPLO 1 and HAPLO_DIST have to be specified

HAPLO_DIST

50000

0

To define the haplotype distance. Here all pairs/triples less than 50000 bp apart

DOHAPFILE

0

Produce proxy-file for YAMAS (section 2.5)

SIMULATION

Number of MC-simulations (0=no simulations will be conducted)

MC_WITH_SM

0

1=MC-simulations account for multi-marker AND single-marker tests. 0=MC-simulations account only for multi-marker tests

PRINTTOP

100

Best n multi-marker p-values are printed

PFILTER

p-value cutoff for multi-marker analysis

OUTPUTNAME

./myproject/test

Specify the output name

ANNOTATE

Add annotation information in the output files

GENECOL

5

Specify column number of gene in the annotation information

DOIBS

0

Calculate average IBS-value of all possible pairs: 1=yes and stop afterwards, 2=yes and proceed with analysis, 0=no

IBS_SD_RELATIVES

4.0

Limit in standard deviations for high/low-IBS. Pairs that exceed this limit are considered to be relatives

IBS_SD_OUTLIER

4.0

Limit in standard deviations. People whose low-IBS-count exceed this limit are considered to be outlier

STRATIFY_STRUCT

0

PAIR, GROUP or CLUSTER

STRATIFY_VALID

VICINITY

CLUSTER, VICINITY

MATCHINGCHOICE

chr4;chr12,123000-160000;chr24;

Selection of certain regions (ONLY the stated regions will be considered for the pairwise matching)

CLUSTER_COVARS

1

0

Creates one covariate for each cluster and puts 0/1 for individuals                                                                                                                       ´  cluster affiliation. Note: choose a test that considers covariates.

FAMCLUSTER

1

0

Creates permution clusters from families, overrides STRATIFY

2.4 Output files

The following output files are created by the program:

2.4.1 Output single marker

The parameters in the output file depend on the used single-marker-test (Armitage’s trend test, genotype test with 2 d.f., logistic regression with 1 d.f., logistic regression with 2 d.f.). The following list shows all possible parameters.

2.5 Produce SNP-Proxy-Files for Meta-Analysis with YAMAS

With INTERSNP it is possible to produce a file with r2-values for all SNPs that lie within a pre-specified distance, for instance from HAPMAP and/or 1000 Genomes data (.tped/.tfam files). In addition to the basic keywords (TPED/TFAM, QC options,...), the following keywords have to be specified in the selectionfile:

TWO_MARKER 1 
TEST 13 
QT 1 
HAPLO 1 
HAPLO_DIST 10000 
DOHAPFILE 1

In the above example, r2 will be computed for all SNPs with a distance smaller than 10000 bp. The outputfile (*hapfile.txt) will contain all SNP pairs for which r2 0.20.

Example:

16 rs1211375 A C rs3918352 A G 0.338 1000 0 
16 rs3918352 A G rs11642609 C T 0.659 2000 1 
16 rs8051485 T C rs11248914 C T 0.212 15722 1 
...
Column 1)
chromosome
Column 2)
marker id of the first marker, e. g. rs-id of the first SNP
Column 3)
minor allele of the first marker or SNP
Column 4)
major allele of the first marker or SNP
Column 5)
marker id of the second marker, e. g. rs-id of the second SNP
Column 6)
minor allele of the second marker or SNP
Column 7)
major allele of the second marker or SNP
Column 8)
the difference of the markers base pair positions
Column 9)
r2
Column 10)
identifies proxy alleles. In line 1, 0 indicates that the haplotype A-A (composed of columns 3 and 6, for each SNP the first listed allele) is more frequent than expected under linkage disequilibrium. It immediately follows that also the C-G haplotype is more frequent than expected under linkage disequilibrium. Therefore, allele A from SNP 1 and allele A from SNP 2 are mutual proxy alleles, and likewise allele C from SNP 1 and allele G from SNP 2. In line 2, the situation is reverse, as indicated by a 1 in the last column. Here, allele A (column 3) from SNP1 and allele T (second(!) allele of SNP 2, column 7) are the mutual proxy alleles.

A more detailed output is obtained with the choice DOHAPFILE 2.

16 rs1211375 A C rs3918352 A G 0.378 0.622 0.425 0.575 0.300 0.078 
0.125 0.497 0.641 0.338 0 
16 rs1203957 T G rs3918352 A G 0.110 0.890 0.425 0.575 0.110 0.000 
0.315 0.575 1.000 0.167 0 
16 rs3918352 A G rs11642609 C T 0.412 0.588 0.485 0.515 0.000 0.412 
0.485 0.103 1.000 0.659 1

The first seven columns are the same as before. Columns 8-9 are allele frequencies for SNP 1, columns 10-11 are allele frequencies for SNP 2. Columns 12-15 are the four haplotype frequencies, here in lexicographic order, i.e. ignoring the proxy definition. Column 16 is D, column 17 is r2, and column 18 is the proxy indicator. The computation of the indicator is based on the allele and haplotype frequencies listed.

For more details visit the YAMAS web-page http://yamas.meb.uni/bonn.de/

Chapter 3
Examples

Here we present how to conduct a full Genome-wide Interaction Analysis with a pre-test, six GWIA strategies to demonstrate the usage of the software and how to do a Genome-wide Haplotype Analysis. Notice that only the relevant options are presented in the selection file.

3.1 Genome-wide Interaction Analysis with Pre-test

TPED    ./test.tped   // path to the .tped file 
TFAM  ./test.tfam   // path to the .tfam file 
SINGLE_MARKER  1  // Armitages trend test 
PRETEST 1 // pre-test: 1=yes 
PRETEST_CUTOFF 0.01 // pre-test threshold. Here, threshold for 
                       pre-test is 0.01 
TWO_MARKER  1  // two-marker analysis: 1=yes 
TEST  2  // log-linear model, test for genotypic interaction 
OUTPUTNAME  ./myproject/test  // specify the output name 
END

3.2 Strategy I

All pairs of SNPs with at least one SNP among the top 10 single-marker results

TPED    ./test.tped  // path to the .tped file 
TFAM    ./test.tfam  // path to the .tfam file 
SINGLE_MARKER  1  // Armitages trend test 
TWO_MARKER  1  // two-marker analysis: 1=yes 
TEST  2  // log-linear model, test for genotypic interaction 
SINGLETOP   10// top 10 single-marker results 
M_WITH_SINGLETOP  1  // at least one SNP among the top 10 
                        single-marker results 
OUTPUTNAME  ./myproject/test   // specify the output name 
END

3.3 Strategy II

All pairs of SNPs which are among the top 1000 single-marker results

TPED  ./test.tped  // path to the .tped file 
TFAM  ./test.tfam  // path to the .tfam file 
SINGLE_MARKER  1  // Armitages trend test 
TWO_MARKER  1  // two-marker analysis: 1=yes 
TEST  5  // logistic regression model: 
            test for additive interaction 
SINGLETOP  1000  // top 1000 single-marker results 
M_WITH_SINGLETOP    2   // all pairs of SNP which are among the top 1000 
                           single-marker results 
OUTPUTNAME  ./myproject/test   // specify the output name 
END

3.4 Strategy III

All pairs of SNPs which are among the top 50000 single-marker results and also lie in a coding region of a gene

TPED  ./test.tped  // path to the .tped file 
TFAM  ./test.tfam  // path to the .tfam file 
ANNOTATIONFILE./annotationfile.txt  // path to the annotation file 
SINGLE_MARKER  1  // Armitages trend test 
TWO_MARKER  1  // two-marker analysis: 1=yes 
TEST  5  // logistic regression model: 
            test for additive interaction 
SINGLETOP      50000  // top 50000 single-marker results 
M_WITH_SINGLETOP    2   // all pairs of SNPs which are among the 
                           top 50000 single-marker results 
GENETIC_IMPACT3  // genetic impact: within coding region 
M_WITH_GENETIC_IMPACT   2  // all pairs of SNPs which lie in a 
                              coding region of a gene 
OUTPUTNAME  ./myproject/test  // specify the output name 
END

3.5 Strategy IV

All pairs of non-synonymous SNPs

TPED  ./test.tped // path to the .tped file 
TFAM  ./test.tfam    // path to the .tfam file 
ANNOTATIONFILE./annotationfile.txt // path to the annotation file 
SINGLE_MARKER  1  // Armitages trend test 
TWO_MARKER  1  // two-marker analysis: 1=yes 
TEST  5  // logistic regression model: 
            test for additive interaction 
GENETIC_IMPACT  4  // genetic impact: non-synonymous SNPs 
M_WITH_GENETIC_IMPACT  2  // all pairs of non-synonymous SNPs 
OUTPUTNAME  ./myproject/test   // specify the output name 
END

3.6 Strategy V

All pairs of SNPs which are among the top 50000 single-marker results and also lie in the same pathway

TPED  ./test.tped // path to the .tped file 
TFAM  ./test.tfam    // path to the .tfam file 
PATHWAYFILE ./KEGG_2_SNP_229.txt // path to the pathway file 
SINGLE_MARKER  1  // Armitages trend test 
TWO_MARKER  1  // two-marker analysis: 1=yes 
TEST  2  // log-linear model: test for genotypic interaction 
SINGLETOP  50000  // top 50000 single-marker results 
M_WITH_SINGLETOP  2  // all pairs of SNPs which are among the 
                        top 50000 single-marker results 
PATHWAY  1  // all pairs of SNPs which lie in the same pathway 
OUTPUTNAME  ./myproject/test   // specify the output name 
END

3.7 Genome-wide Haplotype Analysis

TPED    ./test.tped // path to the .tped file 
TFAM    ./test.tfam // path to the .tfam  file 
SINGLE_MARKER  1  // Armitages trend test 
TWO_MARKER  1  // two-marker analysis: 1=yes 
HAPLO 1// haplotype analysis 
HAPLO_DIST  50000  // pathway analysis with Fisher score 
OUTPUTNAME  ./myproject/test  // specify the output name 
END

Chapter 4
Tests in INTERSNP

4.1 Single-marker tests

By default, a single-marker p-value is computed for all QC-SNPs. For the autosomes and the pseudoautosomal region of the Y chromosome the user can either choose Armitage’s trend test (Armitage1955) or the genotype test with 2 degrees of freedom (d.f.). It is also possible to include covariates in the logistic/linear regression, the allelic test with 1 d.f. and the genotypic test with 2 d.f. Another advantage of the logistic/linear regression is that all chromosomes, including X and Y, can be analyzed. Without logistic/linear regression Y-chromosomal markers are evaluated with the χ2-test for the 2 × 2 table of allele counts in male individuals. For X-chromosomal SNPs, we use the allele-based test with 1 d.f. suggested by D. Clayton (Clayton2008). Claytons’s approach guarantees that hemizygote males and homozygote females contribute equally to the test statistic, reflecting the fact that only one of the X-chromosomes is active in females.

4.2 Two- and three-marker analysis

4.2.1 χ2-test for full genotype contingency table (TEST 1)

For two SNPs, there are 3 × 3 = 9 two-marker genotypes. We consider the respective 3 × 3 × 2 contingency table of counts in cases and controls. Let T be the standard test statistic for contingency tables, i.e., T = i,j,k(Oi,j,k-Ei,j,k)2
    Ei,j,k, where i,j ∈{1,2,3}, k ∈{1,2}, Oi,j,k is the observed number of counts in cell (i,j,k) and Ei,j,k is the number of counts in cell (i,j,k) expected under the null hypothesis. T is χ2-distributed with 8 d.f. By construction, TEST 1 does not yield a test for interaction, but a test that has power if both SNPs interact. The test can become significant through marginal SNP effects alone.

4.2.2 Test for interaction using a log-linear model (TEST 2)

Following (Bishop et al.2007) the observations xijk of the 3 × 3 × 2 contingency table are modeled by fitting a log-linear model with expected cell counts mijk to the cell entries log mijk = u + u1(i) + u2(j) + u3(k) + u12(ij) + u13(ik) + u23(jk) without the term u123(ijk). Testing the null hypothesis H0 : u123(ijk) = 0 yields an explicit test for genotypic interaction. With maximum likelihood estimates mijk for the cell counts we obtain the test statistic T = 2 i,j,kxijk log  xijk
mijk that is χ2-distributed with 4 d.f. The test is equivalent to the respective 4 d.f. genotypic test in the logistic regression framework, but is much faster to compute.

4.2.3 Logistic regression (TESTS 3-12)

TEST 1 and TEST 2 are quick screening tests.

With logistic regression (Cordell and Clayton2002) a more flexible modeling is possible: Test for and under interaction, allelic und genotypic tests and the use of covariates.

We define logit(p) := log( p
1--p) = βT x, where β is the vector of coefficients to be estimated and x is a vector representing genotype information.

4.2.3.1 Parameter coding

We follow the coding suggested by (Cordell and Clayton2002). For each SNP i, i ∈{1,2,3}, we model its allelic effect xi by coding the genotypes (1,1), (1,2), and (2,2) as x1 = -1, x2 = 0, x3 = 1. Next, we model dominance effects xi,D, i ∈{1,2,3}, as x1,D = -0.5, x2,D = 0.5, x3,D = -0.5 for the genotypes (1,1), (1,2), and (2,2), respectively. By multiplication, we obtain interaction terms, for instance, x1x2 represents allelic interaction between SNPs 1 and 2, while x1,Dx2,D represents interaction between the dominance terms of SNPs 1 and 2. Note that these interaction terms code interaction on an additive logit scale and, hence, on a multiplicative odds ratio scale.

4.2.3.2 Likelihoods

Let β0 be the intercept parameter that defines the baseline likelihood L0 := logit(p) = β0. Next, the likelihood L1A := β0 + β1x1 models the allelic effect of SNP 1 and comparison to L0 yields a likelihood-ratio test with 1 degree of freedom. Analogously, comparison of L1G := β0 + β1x1 + β1,Dx1,D to L0 yields a genotypic test for SNP 1 with 2 d.f. In general, we let L1,2A and Li,jG denote likelihoods containing allelic terms, or, respectively, allelic and genotypic terms, for SNPs 1 and 2. In addition let, L1,2A,I and L1,2G,I be likelihoods that also contain interaction terms, for instance, L1,2A,I = β0 + β1x1 + β2x2 + β1,2x1x2 and L1,2G,I = β0 +β1x1 +β1,Dx1,D +β2x2 +β2,Dx2,D +β1,2x1x2 +β1,2Dx1x2,D +β1D,2x1,Dx2 +β1D,2Dx1,Dx2,D

Test No Test Formula d.f. Comment
1 chi-square-test 8 Full genotype test (contingency table)
2 log-linear model l1,2G,I vs l1,2G 4 Test for genotypic interaction
3 logistic regression L1,2A,I vs L0 3 Full additive test
4 logistic regression L1,2G,I vs L0 8 Full genotype test
5 logistic regression L1,2A,I vs L1,2A 1 Test for additive interaction
6 logistic regression L1,2G,I vs L1,2G 4 Test for genotypic interaction
7 logistic regression L1,2A,I vs L1A 2 Additional allelic effect of SNP2
8 logistic regression L1,2G,I vs L1G 6 Additional genotypic effect of SNP2
9 logistic regression L1,2,3A,I vs L0 7 Full additive test
10 logistic regression L1,2,3A,I vs L1,2,3A 4 Test for allelic interaction
11 logistic regression L1,2,3A,I vs L1,2,3A,I2 1 Test for 3-fold allelic interaction
12 logistic regression L1,2,3A,I vs L1,2A,I 4 Additional allelic effect of third locus
15 chi-square-test 1 case-only allelic interaction test
16 chi-square-test 4 case-only SNP-SNP interaction test
M logistic regression - - User defined logistic regression (modelfile)
Table 4.1: Model table for case-control data

4.2.4 Linear regression

To analyze quantitative traits we use linear regression. The parameter coding is equal to logistic regression. For linear regression, we define y = βT x, where y is a quantitative trait variable, and the sum of squared errors (SSE) is computed. The errors measure, per person i, the deviation between the observed value yi and its estimate ŷi predicted by the model.

Test No Test Formula d.f. Comment
3 linear regression SSE1,2A,I vs SSE0 3 Full additive test
4 linear regression SSE1,2G,I vs SSE0 8 Full genotype test
5 linear regression SSE1,2A,I vs SSE1,2A 1 Test for additive interaction
6 linear regression SSE1,2G,I vs SSE1,2G 4 Test for genotypic interaction
7 linear regression SSE1,2A,I vs SSE1A 2 Additional allelic effect of SNP2
8 linear regression SSE1,2G,I vs SSE1G 6 Additional genotypic effect of SNP2
9 linear regression SSE1,2,3A,I vs SSE0 7 Full additive test
10 linear regression SSE1,2,3A,I vs SSE1,2,3A 4 Test for allelic interaction
11 linear regression SSE1,2,3A,I vs SSE1,2,3A,I2 1 Test for 3-fold allelic interaction
12 linear regression SSE1,2,3A,I vs SSE1,2A,I 4 Additional allelic effect of third locus
M linear regression - - User defined linear regression (modelfile)
Table 4.2: Model table for quantitative traits
4.2.4.1 Tests (2 SNPs)

For two SNPs, comparison of L1,2A against L0 yields a full allelic test with 3 d.f. (TEST 3), while comparison of L1,2G against L0 yields a full genotypic test with 8 d.f. (TEST 4). In order to test for allelic interaction, one compares L1,2A,I against L1,2A (1 d.f., TEST 5) and in order to test for genotypic interaction one compares L1,2G,I against L1,2G (4 d.f., TEST 6). Furthermore, it is possible to test for the additional effect of SNP 2 while controlling for the effect of SNP 1 by comparing L1,2A,I against L1A (2 d.f., TEST 7), or, by comparing L1,2G,I against L1G (6 d.f., TEST 8). These tests are summarized in the model Table 4.1 and serve as INTERSNP standard tests that can called via their test number.

4.2.4.2 Tests (3 SNPs)

For three SNPs, likelihoods and tests can be formalized analogously. Here, we describe only allelic tests, but note that INTERSNP allows also explicit model specification, in particular genotypic 3-SNP tests. Let L1,2,3A = β0 + β1x1 + β2x2 + β3x3 be the 3-SNP allelic likelihood, let L1,2,3A,I2 = β0 +β1x1 +β2x2 +β1,2x1x2 +β3x3 be the 3-SNP allelic likelihood including the interaction term for SNPs 1 and 2 and let L1,2,3A,I = β0 + β1x1 + β2x2 + β1,2x1x2 + β3x3 + β1,3x1x3 + β2,3x2x3 + β1,2,3x1x2x3 be the allelic likelihood including all pairwise and the three-fold interaction term. Then, testing L1,2,3A,I against L0 yields a full 3-SNP allelic test with 7 d.f. (TEST 9), while testing L1,2,3A,I against L1,2,3A yields a test for two- and three-fold allelic interaction (4 d.f., TEST 10). Testing L1,2,3A,I against L1,2,3A,I2 yields a test for three-fold allelic interaction (1 d.f., TEST 11). Finally, testing L1,2,3A,I against L1,2A,I yields a test for the additional allelic effect of SNP 3 while controlling for SNPs 1 and 2 (4 d.f., TEST12).

4.2.4.3 X-chromosome

For X-chromosomal markers, all dominance and dominance interaction terms are ignored.

4.2.4.4 Covariates

All regression tests can be combined with up to 10 covariates (Options: COVARIATES, COVARIATEFILE). In particular, adjustment for population strata is possible.

4.2.4.5 User-defined regression models

Further tests using logistic regression can be specified via a modelfile. Choose M for the option TEST and specify the path to the modelfile (MODELFILE option). Include/exclude likelihood parameters with 1/0, respectively, as shown in the modelfile provided in the download area. Likelihood L1 is tested against likelihood L2.

4.2.5 Pre-tests

Linear and logistic regression allow flexible modeling and inclusion of covariates, but their computation can be time-consuming, mostly because of the operations on the relatively large design matrices involved. Hence, when a Genome-wide Interaction Analysis shall be conducted, it makes sense to, first, apply a simple, easy-to-compute, pre-test to all SNP pairs and, second, to apply the more sophisticated test only to those pairs for which the pre-test yields a p-value smaller than PRETEST_CUTOFF. A cutoff of 0.01 is a reasonable choice, since it will reduce the number of computations of the "proper" test by a factor of at least 100 on average, but guarantees that no "real" p-values smaller than 10-5 are missed. Since p-values in the range of 10-12 are required for a complete GWIA (Becker2011), the loss of pairs with p-values less significant than 10-5 will in most cases be irrelevant. Even a cutoff of 0.001 or 0.0001 for the pre-test would be acceptable. The only situation in which relevant SNP pairs may be missed occurs when there are covariates with an extreme impact on the p-value, simply because the pre-tests ignore covariates.

4.2.5.1 Pre-test allelic interaction (logistic regression)

When PRETEST 1 is selected in combination with TEST 3 (full allelic test, 3 d.f.) or TEST 5 (test for allelic interaction, 1 d.f.), a log-linear model with 1 d.f. (test for allelic interaction) is applied as a pre-test. We define the log-linear model in analogy to the log-linear model in the genotypic case (our TEST 2, section 4.2.2 ). There, the 3 × 3 × 2 genotype contingency table is considered. In the allelic case, we consider the 2 × 2 × 2 two-SNP-allele table, where the first dimension reflects the two alleles of SNP 1, the second dimension reflects the two alleles of SNP 2, and the last dimension reflects cases/control-status. Double heterozygote cases (controls) contribute 0.5 units to each cell of the 2 × 2 table of cases (controls). This practice is a simple work-around to treat double heterozygotes but renders the accompanying log-linear test conservative when its test statistic is evaluated with the χ2-distribution with 1 d.f. In this sense, the 1 d.f. log-linear test is not a proper test. However, its p-values are strongly correlated with the p-values of the test for allelic interaction in the logistic model, and, therefore, are well suited to serve as a pre-test, when the cutoff is chosen as suggested above.

4.2.5.2 Pre-test genotypic interaction (logistic regression)

In the genotypic case, the log-linear test (TEST 2, section 4.2.2) would be the obvious pre-test. Since this is a proper test in its own right, a useful strategy is to conduct a complete GWIA with TEST 2, produce a COMBILIST (section 2.2.6) with the best results, and run another test on the COMBILIST.

4.2.5.3 Pre-test genotypic interaction (linear regression)

When PRETEST 1 is selected in combination with TEST 4 (full genotype test, 8 d.f.) or TEST 6 (test for genotypic interaction, 4 d.f.), an ANOVA with 4 d.f. (test for genotypic interaction) is applied as a pre-test. Note that when the pre-test is combined with TEST 4, pairs with marginal effects in both SNPs but with just modest interaction may be missed.

4.2.5.4 Pre-test allelic interaction (linear regression)

When PRETEST 1 is selected in combination with TEST 3 (full allelic test, 3 d.f.) or TEST 5 (test for allelic interaction, 1 d.f.), an ANOVA with 1 d.f. (test for allelic interaction) is applied as a pre-test. Double heterozygotes are treated as described for the logistic regression, with the same consequences. Note that when the pre-test is combined with TEST 3, pairs with marginal effects in both SNPs but with just modest interaction may be missed.

Chapter 5
Case-only analysis

Case-only (or case-series, case-control without controls) design is based on a sample consisting of cases only. It is an efficient approach to test for gene-environment or gene-gene interaction in disease etiology, when there is a strong reason to believe that particular assumptions are fulfilled. Concerning a test for gene-environment interaction it is necessary to assume that the genetic type and the exposure occur independently in the population. A test for gene-gene interaction is reasonable to perform assuming that the frequencies of genetic variants are independent in the population or, equivalently, that the genes/SNPs under study are in linkage equilibrium.

Here, we accept that the genetic interaction between two genetic factors (e.g. genes, SNPs, alleles) is present if the joint effect of two factors on a disease differs from the additive combination of independent effects from both of them. The joint effect of two genetic factors is the effect due to the presence of both of them. The independent effect of one of the factors is its effect on a disease in the absence of the other factor. Sometimes the term "epistasis" is used instead of genetic interaction in the literature. The same definition is valid for the gene-environment interaction, where one of the genetic factors is replaced by the exposure.

From now on only the gene-gene interaction is addressed.

If the data satisfy the assumptions discussed above, then the so called case-only test for genetic interaction can be performed by using a χ2- test of independence between genotypes (with four degrees of freedom) or between alleles (with one degree of freedom) as well as logistic regression models. Notably, the case-only analysis leads sometimes to a more valid estimation of interaction than the corresponding case-control analysis, i.e. case-only based tests possess a higher power, see e.g. (Cordell2009), (Gatto et al.2004), (Piegorsch et al.1994), (Yang et al.1999).

INTERSNP suggests several tests to check for the gene-gene interaction, where the genetic factors are chosen to be a pair of SNPs (SNP-SNP interaction). Starting from a pair of SNPs, the allelic interaction test can be performed as well.

IMPORTANT: In our study it was observed that the LD distance between two SNPs has to be at least 20 Mb (20000000 bp) in order to satisfy the assumption of linkage equilibrium. The recommended LD distance is even bigger and equals 50 Mb.

5.1 Case-only allelic interaction test (TEST 15)

TEST 15 in INTERSNP is asymptotically equivalent to the case-only allelic interaction test in whole-genome analysis package PLINK and based on a χ2-test of independence between two alleles with one degree of freedom, see e.g. (Cordell2009), (Purcell et al.2007). For this test the given SNPs data are organized in a 3 × 3 table of genotypic counts, which is in turn reduced to the 2 × 2 table of allelic counts.

5.2 Case-only SNP-SNP interaction test (TEST 16)

TEST 16 in INTERSNP suggests to check for SNP-SNP interaction in the case-only mode. TEST 16 is based on a χ2-test of independence between two SNPs with four degrees of freedom, where the 3 × 3 table of genotypic counts is used. Let xij, i,j ∈{1,2,3} be the observed counts. Denote x+j = i=13xij, j ∈{1,2,3}, xi+ = j=13xij,i ∈{1,2,3}, and x++ = i,j=13xij. Expected counts are determined by

      { #########
        #xi+xx+++j,##ifx++  ⁄= 0,
mij =        0,  ifx++  = 0,
for all i,j ∈{1,2,3}. The test statistic T is defined by
      ∑3
T = 2     xij log xij,
      i,j=1       mij
where only positive counts xij,mij are included. The situation, when some of the observed counts are zero, may course a decrease of the degrees of freedom in the corresponding χ2-test. The same is true for X-chromosomal SNPs in males. TEST 16 uses all cases for analysis.

5.3 Selection file for case-only analysis

An example of a selection file to perform the case-only TEST 15 with INTERSNP is given below. Analogously, a selection file for the TEST 16 can be composed.

TPED  ./test.tped // path to the tped-File 
TFAM  ./test.tfam // path to the tfam-File 
LD_DISTANCE 50000000 // distance between two SNPs, in base pairs (bp) 
SINGLE_MARKER  1      // Armitages trend test 
TWO_MARKER    1      // two-marker analysis: 1=yes 
TEST  15  // case-only allelic interaction test 
SINGLETOP      1000  // top 1000 single-marker results 
PRINTTOP      100    // best 100 multimarker p-values are printed 
OUTPUTNAME  ./myproject/test  // specify the output name 
END

Chapter 6
Pathway association analysis

Pathway association analysis (PAA) tests for an excess of moderately significant SNPs in genes from a common pathway. Here, we present a Monte-Carlo simulation framework that allows to formulate the main ideas of existing PAA approaches:

Our paper "Integrated Genome-Wide Pathway Association Analysis with INTERSNP" (Herold et al.2012) describing the test in more details.

In order to test for association of a pathway with a phenotype the following parameters have to be specified. See also the example selectionfile "selectionfile_S6.txt".

PATHWAYANALYSIS 1 indicates that pathway association analysis shall be carried out. The keyword PATHWAYFILE has to be followed by the path to the pathwayfile that shall be analyzed, for instance the KEGG pathwayfile "KEGG_2_snp_b129.txt.ann" that can be found in the download section of the homepage.

The various pathway scores are constructed from all SNP with a p-value smaller than a pre-specified p-value p*. <singletop> has to be chosen accordingly. For instance, if p* = 0.05 is the desired threshold for SNPs to be included into the pathway score and if 500000 QC-SNPs are in the .tped file, then 500000 * 0.05 = 25000 is the appropriate value for <singletop> (SINGLETOP 25000). Of course, the data set may contain more SNPs with a p-value smaller then 0.05, but fixing the number 25000 instead protects against inflation of type I error due to residual inflation of the single-marker test statistics. Note that all single-marker tests can be combined with the pathway analysis. In particular, quantitative traits and covariates can be handled.

6.1 Pathway scores

Currently 5 different pathway scores are available, identified by the numbers 1 to 5.

6.2 Simulation

Significance of the individuals’ pathways as well as after adjustment for testing multiple pathways is determined via Monte-Carlo simulations based on genome-wide permutation of case-control status or quantitative trait value. At least 3500 simulations are the minimum requirement to get experiment-wise significant results with the KEGG pathways. Rather then choosing 1000 or 10000 as the number of replicates we recommend 999 or 9999, respectively.

6.3 Selection file PAA

Here is an example of a selection file to perform the PAA with PATHWAYTEST 1. The selection files for PATHWAYTEST 2,3,4,5 are analogous.

TPED  ./test.tped // path to the .tped file 
TFAM  ./test.tfam    // path to the .tfam file 
PATHWAYFILE    ./KEGG_2_snp_b129.txt  // path to the pathway file 
SINGLE_MARKER  1 // Armitages trend test 
SINGLETOP   25000 // top 25000 single-marker results 
PATHWAYANALYSIS  1  // pathway association analysis 
PATHWAYTEST  1  // pathway analysis with Fisher score 
SIMULATION  999  // number of permutations 
OUTPUTNAME  ./myproject/test  // specify the output name 
END

Selection file for the interaction-Ratio (PATHWAYTEST 6)

TPED  ./test.tped // path to the .tped file 
TFAM  ./test.tfam    // path to the .tfam file 
PATHWAYFILE    ./KEGG_2_snp_b129.txt  // path to the pathway file 
SINGLE_MARKER  1 // Armitages trend test 
TWO_MARKER    1      // two-marker analysis: 1=yes 
TEST  2  // log-linear model, test for genotypic interaction 
PATHWAYANALYSIS  1  // pathway association analysis 
PATHWAYTEST  6  // pathway analysis with Fisher score 
P_INTERRATIO 0.05  // p-value threshold for the interaction pairs 
SIMULATION  999  // number of permutations 
OUTPUTNAME    ./myproject/test  // specify the output name 
END

Chapter 7
INTERSNP-RARE:
Genome-Wide Rare Variant Analysis

7.1 Introduction

INTERSNP-RARE is an extension build upon INTERSNP’s framework for genome-wide, permutation-based rare variant analysis. Currently, three rare variant testing procedures are implemented: Fisher-Rare - FR. based on the Fisher combination test (Fisher1925), COLLapsing test - COLL, (Li and Leal2008) and the Cumulative Minor Allele Test - CMAT, (Zawistowski et al.2010). All tests are extended to their variable-threshold (VT) counterparts (keyword OPTIMAL). Whether a SNP is rare is determined by the parameter to the keyword RARE, which is interpreted as the threshold MAF for defining ”rareness”. MAF is calculated from the whole sample, irregardless of individuals’ affection status. In the case of a zero MAF, the SNP in question is homozygous and is not considered to be rare. In the literature, common values for defining rare SNPs are 0.01 or 0.05, though this may depend on the particular dataset.

Statistical testing of disease association is conducted on rare variants in physical proximity (”bins”). Thus, a physical position is required for each marker to be analyzed. The choice of a particular binning is not straightforward and has to be chosen taking various properties of the dataset into account. In INTERSNP-RARE, extensive support for bin assignment is provided.

Prior to elaborating on the details, it is necessary to fix our terminology.

Bin: a subset of discrete SNPs in physical proximity, taken from the set of all SNPs in a study. Bins can be created algorithmically from genotype data. Automatic binning can be performed either by number of SNPs (BINSIZE_SNP), by number of rare SNPs (BINSIZE_RARE) or by size of the region in base pairs (BINSIZE_DIST).

Interval: a continuous region of a chromosome. Intervals are uniquely defined by a chromosome number, a start and an end position. Intervals may represent features like LD blocks or genes. Intervals are independent of any particular study data and are oblivious to any peculiarities therein. If intervals are provided, bins are created by projection of intervals on a dataset.

Intervals are supplied by an external file (keyword INTERVAL_IN). The maximal size for an interval is a whole chromosome. Pseudoautosomal regions and sex chromosomes are ideally treated as independent chromosomes.

Support for overlapping and non-overlapping bins and intervals is provided.

An overview of keywords for controlling rare variant testing are listed in the next section.

7.2 Options

Keyword

Example

Description

RARE

0.05

Definition of rare MAF

OPTIMAL

1

1 for variable threshold analysis (VT)

MAF_ADJUST

1

1 (default): MAF is calculated from expected, (not observed) number of alleles. No effect if SNP missing rate=0.

CHR_BOUNDARIES

chromosome_coord_hg19.txt

File with start and end positions of chromosomes and PARs

BINSIZE_SNP

4000

Number SNPs per bin (rare and common)

BINSIZE_RARE

50

Number rare SNPs per bin

BINSIZE_DIST

800000

Physical distance in bp per bin

RARE_TESTS

COLL;CMAT;FR;

Tests to be conducted. Default: all

WEIGHTS

1/SD | BETA;1;25; | LOGISTIC;0.07;150;

Weights of variants for non-collapsing tests

LAMBDA_ADJUST

1

Correction for genome-wide FR testing

RARE_PRE_PRETEST

0.0001

Pretest before simulations are conducted: eliminate bins below (conservative) threshold

RARE_PRETEST

100  0.1

Pretest for rare variant analysis. Takes two arguments: number of simulations and p-threshold for passing the pretest

INTERVAL_IN

Ensembl_Genes69_hg19.txt

File containing intervals

INTERVAL_FORMAT

[Semicolon-separated list]

Formatting of the interval file

HEAD=”1”;

Number of header lines

SEPARATOR=”\t”;

Field separator

COLUMNS=”5”;

Number of columns

CHR=”1”;

Column with chromosome number

START=”2”;

Column with interval start positions

END=”3”;

Column of interval end positions

FEATURE=”4”;

Column containing ”feature”

DESCRIPTION=”5”;

Column containing ”description”

MERGE_INTERVALS

1

1 for merging of overlapping input intervals

FLANKING

1000

Flanking of intervals in bp

CONCATENATE_INTERVALS

0

Number of consecutive intervals to be concatenated

FILL_GAPS

1

For a non-overlapping set of intervals: assign uncovered space to intervals

FILTER_SMALL_BINS

1

Filter bins: minimum required number of rare SNPs

SPLIT_LARGE_BINS

1

Filter bins: maximim allowed number of rare SNPs; bins will be split

7.3 Statistical Tests

The keyword RARE_TESTS allows to choose tests to be conducted from one or more of the following: FR;CMAT;COLL;. The list has to be separated by semicolons. By default, all tests are selected.

7.3.1 FR

The implemented Fisher-Rare test is based on Fisher’s combined probability test. The test statistic is determined by combining the N single-marker p-values of N rare variants in a particular bin

          N
T   =  - 2 ∑ w log(p).
 F R           i    i
          i=1

In our implementation, all weights are wj = 1 by default (see 7.4 on custom weights).

For fixed threshold analysis, the p-value is determined from the statistic by permutation of the case-control status (or quantitative trait), while keeping genotype data unchanged.

For independent p-values, the overall p-value may be determined via χ2-distribution with 2N degrees of freedom. Due to linkage disequilibrium (LD), the pi are not independent, which leads to the χ2-distribution being an anticonservative approximation. However, the χ2-distribution is useful for finding the optimal rare MAF by raising the threshold frequency and determining the overall p-value with the new set of rare SNPs, and taking the MAF for which the p-value is minimal. The ”true” p-value is calculated by permutation (see 7.5).

Setting the parameter of the keyword LAMBDA_ADJUST to 1 instructs INTERSNP to normalize singlemarker-pi to be normalized by dividing corresponding test statistics by the inflation factor. Use this option only with a large number of SNPs, i.e. in genome-wide association studies.

7.3.2 CMAT

CMAT, the cumulative minor allele test (Zawistowski et al.2010) utilizes the weighted sum (with weights wj) over F minor allele frequencies of rare variants in NA cases and NU controls

                                             2
TCMAT  =  --NA-+-N∑U----×  -(mAMU-----mU-MA--)--,
          2NANU    j wj   (mA + mU )(MA  + MU )

where mA, mU, MA and MU are weighted minor and major allele counts in affected and unaffected individuals

      NA  F
m   = ∑  ∑   w X
  A           j  ij
      i=1j=1

      N∑U ∑F
mU  =        wjYij
      i=1 j=1

       N   F
      ∑ A ∑
MA  =        wj(2 - Xij)
       i=1 j=1

      ∑NU ∑F
MU  =        wj(2 - Yij).
      i=1 j=1

Here, Xij and Y ij are the number of minor alleles at the jth variant position for the ith case and control, respectively.

The p-values for the CMAT test are determined by permutation.

7.3.3 COLL

The collapsing test COLL (Li and Leal2008) dichotomizes individuals by the presence of any rare variant in a given bin

           (N   + N  )× (XN   - Y N  )2
TCOLL =  -----A----U--------U-------A-----,
         NANU  (X + Y )(NA  + NU - X  - Y )

where X is the number of cases carrying a rare variant:

     N∑A
X  =    Xi.
     i=1

Xi are indicator variables for the ith case:

     {
Xi =   1 if wjXij > 0 for any 1 ≤ j ≤ F
       0 otherwise,

and analogous definitions of Y , Y i for controls.
Alternatively, the test statistic can be expressed as

               (NA + NU  )(N CNU - N C NA )2
TCOLL  = -------(--C----C-)(A--------U---C-----C)
         NANU    NA  + NU   NA  + NU - N A - N U

The p-values of the COLL test are determined by the χ2-distribution with one degree of freedom in case of a fixed threshold and by permutation in case of a variable threshold.

7.4 Weights

The weights (default: wi = 1) can be used in non-collapsing tests to weight alleles by a pre-determined scheme that mirrors our estimation of their risk.

Currently, three weighting schemes are supported (see also Wu et al. (2011) and references therein).

1/SD. The weights are reciprocal to the standard deviation of allele counts

w =  ∘--------1---------.
 i     M AFi (1 - M AFi )

BETA;a;b;. The weights are proportional to the Beta distribution (default: a = 1, b = 25)

w =  M AF a-1(1 - M AF  )b-1 .
 i        i            i

LOGISTIC;a;b;. Instead of strongly upweighting rare variants, logistic weights are rather a ”soft threshold” alternative to a fixed MAF threshold (default: a = 0.07, b = 150)

            1
wi = -----(MAFi-a)b.
     1 + e

The weights of rare variants are approximately 1 and for common variants 0.

7.5 Variable Threshold

The Variable Threshold (VT) analysis is activated using the keyword OPTIMAL. The extension to the variable threshold model for COLL and CMAT tests is motivated by the principle described in (Price et al.2010).

COLL, CMAT: with NSIM permutations, calculate Tmaxj for each value of MAFij MAFT contained for each permutation j. The proportion of those Tmaxj that are larger than the non-permuted Tmax0 is used to determine the corresponding p value:

              #{T ′jmax > T0  }
pCOLL,CMAT  = -----------max--.
                   NSIM
(7.1)

FR: The VT approach requires a modification in case of FR test. Since the test statistic depends on a different number of p values at different MAFi levels, they can not be directly compared. TFR,MAFi is also monotonically increasing with the value of MAF. The test statistic is thus not comparable in variable threshold mode. A multiparametric function is required to map different TFR on an interval taking the number of variants into account, that allows a comparison operator.
One of the suitable functions that maps TFR and allows comparison is the χ2 distribution with 2m degrees of freedom. Although due to LD the distribution is, in general, anticonservative, it can be used since the absolute pEst enter the calculation for being used for comparison (”<” and ”>”), at different MAF levels only.

TFR=p (the χ2-distribution with 2Ndf) instead of TFR is used to obtain optimal threshold. The p-value is determined by calculating the fraction of smaller minimal test statistics TFRacross permutations and MAF levels:

       #{Tm′jin-<-Tm0in}-
pF R =     NSIM
(7.2)

The FR method is therefore similar to COLL and CMAT VT-methods, except that maximal test statistics are compared at different MAF levels. Small test statistics are considered to be more improbable, which leads to a change of sign in Eq. (7.2) .

7.6 Odds Ratios

Both variable and fixed threshold analyses of COLL and CMAT return ”odds ratios” based on their respective contingency tables. They are useful for determining the direction of effect.

7.7 Rare Pretests

In a permutation framework, computational time and hardware requirements can be significantly reduced by conducting the full number of permutations only for bins that have a chance to come close to the significance level. This is achieved by stepwise analysis.

The first step (RARE_PRE_PRETEST) takes place before any permutations are conducted. We use the fact that the p-values obtained from test statistics TCOLL,TCMAT and TFR for the real dataset on a given bin are rather anticonservative, either due to LD (FR and CMAT) or due to small cell counts (COLL). Choosing a suitable passing threshold eliminates the majority of tests to be conducted.

To reduce computational time even further, we offer an implementation of a ”rare pretest” using a smaller number of simulations. The keyword RARE_PRETEST takes two parameters: an integer number of simulations, after which the passing condition is checked, and a floating point number equal to the threshold p-value for passing the pretest. After the given number of simulations, each window and each individual test is checked for the passing condition.

If the condition for pretests are not satisfied, the p-value will be marked with an asterisk (*) in the output file. All tests are pre-tested independently.

7.8 Advanced Binning Methods

7.8.1 User-Supplied Intervals

In addition to algorithmic binning procedures, extensive functionality for generating and modifying bins created from user-supplied intervals is provided. Bins based on functional classes or statistical characteristics can be generated from a multitude of sources: LD intervals, conserved elements, gene annotation files and more. This set of features is also helpful for validating bins in a different dataset.

Intervals are supplied by the user in a file, one interval per line, by keyword INTERVAL_IN. Obviously, INTERVAL_IN and the three algorithmic binning methods are mutually exclusive.

We provide an intervalfile (Ensembl_Genes69_hg19.txt) based on a list of coding genes imported from Ensembl using BioMart (http://www.ensembl.org/biomart/martview). The file was created using attributes ”Chromosome Name”, ”Gene Start (bp)”, ”Gene End (bp)”, ”Associated Gene Name”, ”Description”, and exported in the *.tsv format. The resulting file was sorted with respect to chromosome name and gene start.

The formatting of the interval file is defined in a string INTERVALFILE_FORMAT. Required arguments are COLUMNS, CHR, START and END, which give the overall number of columns, and the columns containing start and end positions as well as chromosome names in the file. Columns are counted starting from one. INTERSNP accepts different chromosome naming conventions, for instance ”CHR26”, ”chrMT”, ”M” will be recognized as chromosome 26.

Two additional fields from an interval file can be specified by the user, first of which is FEATURE. A feature is handled as a character string and serves the purpose of tracking intervals more easily. A feature might be a gene name or name of an LD block. The feature column is carried all the way through the calculation and will be appended to the output file. In case that intervals undergo modification, the feature column is modified as well. For instance, when consecutive fields are merged or concatenated, their respective features are put in one field. The second optional field is DESCRIPTION. As the name suggests, it is a character string describing the feature, for example gene function. In contrast to the feature column, the description will appear in the output file only if intervals are not merged or concatenated.

Other optional arguments to INTERVAL_FORMAT include HEAD for the number of header lines to be skipped. SEP defines the field separator. Currently, tab ("), space ("SPACE") and comma (",") are supported as field separators.

The correct interval file format string for the provided file (Ensembl_Genes69_hg19.txt) is

HEAD=”1”;SEPARATOR=”’  ’;COLUMNS=”5”;CHR=”1”;START=”2”;END=”3”;FEATURE=”4”;DESCRIPTION=”5”;

7.8.2 Chromosomal Boundaries

It is advised to provide physical limits of chromosomes in a tab-delimited file consisting of chromosome name, start and end position when using an interval file. Positions of pseudoautosomal regions are provided using contig names PAR1_X, PAR2_X, PAR1_Y and PAR2_Y, using X- and Y-chromosomal coordinates, respectively.

This will provide consistency checks for input intervals. Besides, it will guarantee correct handling of pseudoautosomal regions. Some genomic features are known to span pseudo- and non-pseudoautosomal regions of sex chromosomes (XG gene, for example). These features will be split and handled separately. Coordinates of features on the pseudoautosomal region of the Y chromosome will be transformed to X-coordinates and put on chromosome 25. Providing coordinates of pseudoautosomal regions will also ensure separate handling of both PARs.

We provide two coordinate files for build 36 and 37:

chromosome_coord_hg18.txt

chromosome_coord_hg19.txt

The files were manually created using the Ensembl genome browser. Due to complexities, the conventions for beginnings and ends of chromosomes may differ. We choose them as start- end endpoints of the first and last canonical positions of euchromatic chromosomal bands, as referenced in the Ensembl database.

7.8.3 Interval Modification

7.8.3.1 Merging of Input Intervals

In case that some of the intervals from the input file are overlapping, setting MERGE_INTERVALS to 1 will instruct INTERSNP to merge them. In case that intervals have identical coordinates, the default behaviour is to dismiss all but the first of these intervals.

7.8.3.2 Flanking

Providing a parameter to FLANKING will instruct INTERSNP to extend intervals in both directions by that many base pairs. If a CHR_BOUNDARIES file is provided, limits given therein will be respected.

7.8.3.3 Fill Gaps

If the intervals do not cover the genome in a continuous way, some of the markers in a dataset may be found in the space between intervals and will be dismissed from rare variant analysis. If the provided intervals are non-overlapping or have been merged, this problem can be countered by setting FILL_GAPS to 1. The space in between intervals will be divided in half and each part will be assigned to the nearest interval.

7.8.3.4 Concatenating Intervals

In case that the provided intervals are found to be too small, one can set CONCATENATE_INTERVALS to the number of consecutive intervals to be concatenated into one interval.

7.8.4 Size of Bins: Quality Control

7.8.4.1 Filter Small Bins

Some bins may cover only few rare variants, especially when intervals are small and data sparse. In that case, INTERSNP will unnecessarily process useless bins. In addition, the number of bins will reduce power due to correction for multiple testing. This can be countered by setting FILTER_SMALL_BINS to the minimum number of rare variants per bin. Bins not satisfying this condition will be simply dismissed.

7.8.4.2 Maximum Number of Rare Variants

Some bins may cover hundreds or thousands of rare variants. It is advised to set a maximal number of rare variants per bin using the keyword SPLIT_LARGE_BINS to prevent disproportionately high computational burden, in particular when using variable-threshold models.

Bins that are found to contain more rare SNPs than the given limit will be split in bins containing the same number of rare variants, if possible.

7.9 Imputation of Minor Allele Frequencies for Missing Genotypes

MAF_ADJUST is a keyword that controls a simple recalculation of MAFs in SNPs with a nonzero missing rate. The default value of the parameter is 1.

The correction is intended to counter the following problem: the smallest possible MAF for a non-monoallelic SNP is 1∕N, where N is the number of observed alleles. If M alleles are missing, the value of the MAF is somewhat larger. Thus, rare SNPs with same allele counts may have slightly different MAFs. In particular, the value of the observed minor allele frequency of an autosomal SNP i with Bi observed minor allele counts and Mi-
 2 missing genotypes in a dataset with

N  = A  + B  + M
       i    i    i

allele counts and N-
 2 individuals is given by

         ---Bi--   ---Bi---
M  AFi = Ai + Bi = N  - Mi .

If Mi = 0, the set of possible MAFs is given by

M AF   ∈ M = { 1-, 2-,..., BT-|B ≤ M AF  × N }.
     i         N  N     N    T        T

With Mi0, the cardinality of the possible MAFs can be much larger due to different possible values created by the denominator (N - Mi)N. To transform the MAF of an observed allele to the MAF of expected minor allele counts one obtains the expected minor allele counts by

         (        )
B′=  floor  Bi-×-N-- = ⌊-Bi ×-N-⌋,
 i         N - Mi      N  - Mi

where

floor(x) = ⌊x ⌋ = n ∈ ℕ , so that n ≤ x < (n+ 1).

The adjusted MAFs are obtained from

     ′   B′i    1  Bi × N
M AF i = N--= N-⌊N----M--⌋.
                        i
(7.3)

For rare variant analysis of data containing missing genotypes, it is usually useful to reduce the number of possible values of MAFi. This requires two assumptions: first, that genotyping errors do not correlate with the true allele, and that the missing rates MRi and minor allele frequencies MAFi are small enough. In this case, one can adjust the corresponding MAFi as

     ′   Bi′  Bi-
M AF i = N  = N   for ’small’ Mi.
(7.4)

This is the correction used by MAF_ADJUST 1 for (pseudo-)autosomal chromosomes with analogous equations for sex chromosomes and mitochondrial DNA. It can be seen from (7.3) that the condition ’small’ MAFi and MRi can be derived from the condition that the adjusted number of counts Biis greater than Bi by n allele counts:

          1
M Ri <  ---Bi-
        1+  n

In other words, if the condition

M  < --N---,M  ,B  ∈ ℕ
  i  1 + Bi   i   i

is satisfied, the expected counts of minor alleles coincide with the number of observed alleles and (7.4) holds, otherwise it is probable that some of the missing genotypes contain minor alleles.

Note that we use the corrected MAF only for the purpose of reducing levels in variable threshold analysis (7.5) and calculation of weights (7.4).

7.10 Output

7.10.1 Association Testing Results: *Rare.txt

Test results and additional information is written to the *Rare.txt file, which is a fully compatible interval file.

Columns without relevance to the particular analysis are omitted.

7.10.2 Modified Interval File: *_modified.txt

The quality-controlled, modified, sorted intervalfile (INTERVAL_IN) is written to a new interval file in the working directory.

Bibliography

    Armitage P. (1955) Tests for linear trends in proportions and frequencies. Biometrics 11(3): 375–386.

    Becker T., Herold C., Meesters C., Mattheisen M., Baur M.P. (2011) Significance levels in genome-wide interaction analysis (GWIA). Ann. Hum. Genet. 75(1): 29–35.

    Bishop Y.M., Fienberg S.E., Holland P.W. (2007) Discrete Multivariate Analysis. Theory and Application. Springer, Berlin.

    Clayton D. (2008) Testing for association on the X chromosome. Biostatistics 9(4): 593–600.

    Cordell H. and Clayton D. (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am. J. Hum. Genet. 70(1): 124–141.

    Cordell H.J. (2009) Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 10(6): 392–404.

    Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver and Boyd (Edinburgh). ISBN 0-05-002170-2.

    Gatto N.M., Campbell U.B., Rundle A.G., Ahsan H. (2004) Further development of the case-only design for assessing gene-environment interaction: evaluation of and adjustment for bias. Int. J. Epidemiol. 33(5): 1014–1024.

    Ge Y., Dudoit S., Speed T.P. (2003) Resampling-based multiple testing for microarray data analysis. Test 12(1): 1–77.

    Herold C., Steffens M., Brockschmidt F.F., Baur M.P., Becker T. (2009) INTERSNP: genome-wide interaction analysis guided by a priori information. Bioinformatics 25(24): 3275–3281.

    Herold C., Mattheisen M., Lacour A., Vaitsiakhovich T., Angisch M., Drichel D., Becker T. (2012) Integrated genome-wide pathway association analysis with INTERSNP. Hum. Hered. 73(2): 63–72.

    Holmans P., Green E.K., Pahwa J.S., Ferreira M.A., Purcell S.M., Sklar P., Wellcome Trust Case-Control Consortium, Owen M.J., O’Donovan M.C., Craddock N. (2009) Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am. J. Hum. Genet. 85(1): 13–24.

    Kanehisa M., Goto S., Hattori M., Aoki-Kinoshita K.F., Itoh M., Kawashima S., Katayama T., Araki M., Hirakawa M. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34: 354–357.

    Li B., Leal S.M. (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83(3): 311–321.

    O’Dushlaine C., Kenny E., Heron E.A., Segurado R., Gill M., Morris D.W., Corvin A. (2009) The SNP ratio test: pathway analysis of genome-wide association datasets. Bioinformatics 25(24): 2762–2763.

    Piegorsch W.W., Weinberg C.R., Taylor J.A. (1994) Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat. Med. 13(2): 153–162.

    Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3): 559–575.

    Price, AL., Kryukov, GV., de Bakker, PIW., Purcell, SM., Staples, J., Wei, LJ., Sunyaev, SR. (2010) Pooled Association Tests for Rare Variants in Exon-Resequencing Studies Am. J. Hum. Genet. 86(6) 832-838

    Wang K., Li M., Bucan M. (2007) Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet. 81(6): 1278–1283.

    Wu M.C., Lee S., Cai T., Li Y., Boehnke M., Lin X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89:82–93.

    Yang Q., Khoury M.J., Sun F., Flanders W.D. (1999) Case-only design to measure gene-gene interaction. Epidemiology 10(2): 167–170.

    Zawistowski M., Gopalakrishnan S., Ding J., Li Y., Grimm S., Zöllner S. (2010) Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am. J. Hum. Genet. 87(5): 604–617.