Introduction

The twenty-first century biomedical and health sciences have seen a movement from study of rare monogenic disorders to common, multifactorial diseases,1 with high-throughput technologies that enhance dissection of the underlying complex aetiological processes.

Elucidation of multifactorial disease aetiology is challenging because these diseases are, as a rule, causally heterogeneous, and they are driven by a large number of small, additive or synergistic effects, representing consequences of genetic predisposition, lifestyle and the environment. Success in revealing these complex interactions will depend critically on the availability of large-scale documented, up-to-date epidemiological, clinical, biological and molecular sources of information.2, 3, 4, 5, 6, 7, 8

Infrastructures for large-scale population-based research are a focus in the EU sixth (FP6) and seventh (FP7) framework programmes and the European Strategy Forum of Research Infrastructures (http://cordis.europa.eu/esfri/) preparatory phase. As European countries have strong national health-care systems and they have accumulated epidemiological data throughout decades, often complemented with biological samples, it is important that the process of harmonising data and methods takes place within the region and is coordinated with similar efforts in other parts of the world.

Harmonisation in the use of human samples and subject-specific information from large population groups must cover guidelines for study design and statistical analysis. Besides theoretical advances in statistical inference, computational statistics and algorithm development per se, we see a flood of methodological development in statistical genetics and genetic epidemiology in leading journals. Pieces of software, often in the form of standalone programs, are found on individual www pages and/or collected in archives such as http://linkage.rockefeller.edu/soft/list.html. Although good search engines are generally available in the internet, the practising clinicians and geneticists, as well as non-expert statisticians and bioinformaticians, find it difficult to absorb and make use of the rapid methodological development. This in turn results in a suboptimal use of resources.

In 2002, the national biobank programme in Sweden (www.wcn.se) initiated an internet-based statistical genetics information portal, GENESTAT, to provide tutorials, reviews and a discussion forum, related to genetic association studies, with links to key websites and computer programs for the analysis of genetic data.

During 2007, a European GENESTAT working group was assembled with the aim to oversee content, provide an editorial function and make the portal known. This paper by the working group is one such step. The GENESTAT portal has been given its own domain (www.genestat.org), a number of new entries have been added and, most importantly, the portal is set up in an interactive mode, in the hope of attracting interest and attention not only from users but also from experts and methods developers.

This paper introduces the GENESTAT information portal version 2.0 (www.genestat.org), which is the first version put forward to the broad international scientific community. The ultimate aim of GENESTAT is to address design and statistical analysis issues in genetic epidemiology. As a first step, this paper and version 2 of GENESTAT focus on genetic association studies, both candidate gene studies and genome-wide association studies. The engine is a Wikipedia style of information exchange to allow the best mix of sustainability and up-to-date quality. Initially, GENESTAT supports free entries and editing, with only weak supervision. Stricter supervision may be considered later. Each GENESTAT entry is structured to provide a broad overview, with references to key papers and software. In the next section, we give examples of GENESTAT entries and end the paper with a discussion of possible extensions, and with an invitation to methods and software developers to add information to the portal and to edit existing entries.

Current GENESTAT entries

GENESTAT is targeted towards researchers conducting and analysing data from genetic association studies. The portal includes two sections, ‘Genetic Association Studies’, with subsections ‘Planning’, ‘Quality Control’, ‘Population Stratification’, ‘Testing and Estimating Association’ and ‘Statistical Modelling’, which is divided into ‘Modelling Genotypic Information’, ‘Pathways’, ‘Replication’, ‘Meta-analysis’ and ‘Mendelian Randomisation’. Navigation in the portal is simple and the Wikipedia-like structure allows for augmentation of the available information using simple editing tools.

To convey the flavour of the portal, we give short introduction to the current main sections of GENESTAT. The actual entries are found on www.genestat.org.

Before genotyping

This section in GENESTAT discusses questions in genetic study design. These include elaboration of the underlying biological mechanism and about the structure of the study population, choice of markers and phenotype and family- vs population-based independent individual designs.

http://www.genestat.org/index.php?n=GeneStat.PlanningStage

Genotype data quality control

This section gives guidance on procedures for gender checks and relatedness checks, quality control based on call rates and Hardy–Weinberg Equilibrium and discusses combining of data across different studies and platforms.

http://www.genestat.org/index.php?n=GeneStat.GenotypingQualityControl

Population stratification

A thorough section on population stratification discusses genetic confounding caused by the underlying population structure and potentially leading to both false-positive and false-negative results in genetic association studies. This section also presents the current methods for solving the problem in both candidate gene studies and in genome-wide association studies.

http://www.genestat.org/index.php?n=GeneStat.PopulationStratification

Testing and estimating association

The largest section in GENESTAT describes association testing and estimation under different study designs and different kinds of phenotypes. Testing for single-marker associations as well as for haplotypes, interactions and model selection procedures are discussed in this section. In addition, more advanced topics such as controlling for multiple testing and modelling associations in copy number variation along with power comparisons between different tests are presented in this section.

http://www.genestat.org/index.php?n=GeneStat.TestingAndEstimatingAssociation

Modelling genotypic information

This section discusses more advanced topics on structuring genotypic information beyond single-marker analyses. In particular, methods for haplotype estimation, identification of haplotype blocks, measures of linkage disequilibrium and methods for capturing most of the genetic variation in a gene through tag SNPs are discussed.

http://www.genestat.org/index.php?n=GeneStat.MeasuringLinkageDisequilibriumAndHaplotypeEstimation

Analysis of pathways

The pathway section discusses methods for incorporating biological a priori knowledge to the association testing. This can be done, for example, by jointly testing the effects of markers selected from the same biochemical pathway, or by combining information of intermediate and end phenotypes for association testing.

http://www.genestat.org/index.php?n=GeneStat.Pathways

Replication and meta-analysis

Sections about replication and meta-analysis discuss strategies for scientifically meaningful replication of a de novo gene association finding and for combining data and statistical inference across association studies. It also discusses the origin and impact of between-study heterogeneity in association studies.

http://www.genestat.org/index.php?n=GeneStat.Meta-analysis

Mendelian randomisation: inferring causality in observational epidemiology

This section of GENESTAT discusses Mendelian randomisation; a special design for using genetic markers for inferring causality between modifiable risk factors and disease. Inferring causality from observational data is difficult as it is not always clear which of the two associated variables is the cause, which the effect, or whether both are common effects of a third unobserved variable or confounder. Mendelian randomisation is a method that allows to test for, or in certain cases to estimate, a causal effect between modifiable risk factor and disease from observational data in the presence of confounding factors by using common genetic polymorphisms with well-understood effects on exposure patterns.

http://www.genestat.org/index.php?n=GeneStat.MendelianRandomisation

Discussion

The usefulness of GENESTAT will be proven over time. In its current state, groups applying association methods in their daily work benefit most from GENESTAT. A partial aim of GENESTAT is also to improve the quality of statistical analyses of complex disease, and this would be beneficial for the scientific community as a whole.

There are several directions towards which the current GENESTAT information portal could be extended. Differential measurement errors in SNPs and measured lifestyle factors are worth exploring. Harmonisation of SNP measurements from different platforms calls for imputation techniques using the available HapMap data. Novel designs are needed for studying genes and the environment jointly, and with proper meta-analytic methods, the heterogeneity in the phenotype definitions and measurements and strengths of association may be addressed. An increased interest in the design and analysis of population-based studies involving epigenome, transcriptome or proteome data is also expected. The current open content management system, with a Wikipedia type of ‘edit this page’ link on every page, is trivially open for these extensions, in principle, but relies heavily on the commitment of the scientific community with expertise in these areas.

We emphasise that GENESTAT does not cover all the possible statistical methods related to genetic association studies and has no ambition to be complete at any point in time, but rather to develop and evolve over time. The aim today is to provide an interesting embryo for further development that can adapt to a variety of needs from scientists who use human samples and subject-specific information from large population groups. We welcome the broad genetic research community to visit the portal, and we specifically invite the community of statistical genetics methods developers to contribute to its content. The ultimate aim is to create a growing and constantly updated information repository for statistical genetics. GENESTAT success will be manifest by the number of visits to the portal and by the new contributions.