2.3. The configuration file

The sets of population genetic analyses that are run on your population data file and the manner in which the data file is interpreted by PyPop is controlled by a configuration file, the default name for which is config.ini. This is another plain text file consisting of comments (which are lines that start with a semi-colon), sections (which are lines with labels in square brackets), and options (which are lines specifying settings relevant to that section in the option=value format).

[Note]Note

If any option runs over one line (such as validSampleFields) then the second and subsequent lines must be indented by exactly one space.

2.3.1. A minimal configuration file

Here we present a minimal .ini file corresponding to Example 2.1, “Multi-locus allele-level genotype data” A section by section review of this file follows. (Note comment lines have been omitted in the above example for clarity). A description of more advanced options is contained in Section 2.3.2, “Advanced options”.

Example 2.7. Minimal config.ini file

[General]                  1
debug=0			   
			   
[ParseGenotypeFile]        2
untypedAllele=****         
alleleDesignator=*         
validSampleFields=*a_1     
 *a_2			   
 *c_1			   
 *c_2			   
 *b_1			   
 *b_2			   
			   
[HardyWeinberg]            3
lumpBelow=5                

[HardyWeinbergGuoThompson] 4
dememorizationSteps=2000
samplingNum=1000
samplingSize=1000

[HomozygosityEWSlatkinExact] 5
numReplicates=10000

[Emhaplofreq]              6
allPairwiseLD=1
allPairwiseLDWithPermu=0
;;numPermuInitCond=5

Configuration file sections

1

[General]

This section contains variables that control the overall behavior of PyPop.

  • debug=0This setting is for debugging. Setting it to 1 will set off a large amount of output of no interest to the general user. It should not be used unless you are running into trouble and need to communicate with the PyPop developers about the problems.

2

Specifying data formats

There are two possible formats: [ParseGenotypeFile] and [ParseAlleleCountFile]

[ParseGenotypeFile]If your data is genotype data, you will want a section labeled: [ParseGenotypeFile].

  • alleleDesignatorThis option is used to tell PyPop what is allele data and what isn't. You must use this symbol in validSampleFields option. The default is *. In general, you won't need to change it. [Default: *]

  • untypedAlleleThis option is used to tell PyPop what symbol you have used in your data files to represent untyped or unknown data fields. These fields MAY NOT BE LEFT BLANK. You must use something consistent that cannot be confused with real data here. [Default: ****]

  • validSampleFieldsThis option should contain the names of the loci immediately preceding your genotype data (if it has three header lines, this information will be on the third line, otherwise it will be the first line of the file).[There is no default, this option must always be present]

    The format is as follows, for each sample field (which may either be an identifying field for the sample such as populat, or contain allele data) create a new line where:

    • The first line (validSampleFields=) consists of the name of your sample field (if it contains allele data, the name of the field should be preceded by the character designated in the alleleDesignator option above).

    • All subsequent lines after the first must be preceded by one space (again if it contains allele data, the name of the field should be preceded by the character designated in the alleleDesignator option above).

    Here is an example:

    validSampleFields=*a_1
     *a_2
     *c_1
     *c_2
     *b_1
     *b_2    Note initial space at start of line.

    Here is example that includes identifying (non-allele data) information such as sample id (id) and population name (populat):

    validSampleFields=populat
     id
     *a_1
     *a_2
     *c_1
     *c_2
     *b_1
     *b_2

[ParseAlleleCountFile]If your data is not genotype data, but rather, data of the allele-name count format, then you will want to use the [ParseAlleleCountFile] section INSTEAD of the [ParseGenotypeFile] section. The alleleDesignator and untypedAllele options work identically to that described for [ParseGenotypeFile].

  • validSampleFieldsThis option should contain either a single locus name or a colon-separated list of all loci that will be in the data files you intend to analyze using a specific .ini file. The colon-separated list allows you to avoid changing the .ini file when running over a collection of data files containing different loci. e.g.,

    validSampleFields=A:B:C:DQA1:DQB1:DRB1:DPB1:DPA1
     count

    Note that each .pop file must contain only one locus (see Note in Example 2.6, “Allele count data”). Listing multiple loci simply permits the same .ini file to be reused for each data file.

3

[HardyWeinberg]

Hardy-Weinberg analysis is enabled by the presence of this section.

  • lumpBelowThis option value represents a cut-off value. Alleles with an expected value equal to or less than lumpBelow will be lumped together into a single category for the purpose of calculating the degrees of freedom and overall p-value for the chi-squared Hardy-Weinberg test.

4

[HardyWeinbergGuoThompson]

When this section is present, an implementation of the Hardy-Weinberg exact test is run using the original (Guo & Thompson, 1992) code, using a Monte-Carlo Markov chain (MCMC). In addition, two measures (Chen and Diff) of the goodness of fit of individual genotypes are reported under this option (Chen et al., 1999) By default this section is not enabled. This is a different implementation to the Arlequin version listed in Section 2.3.2, “Advanced options”, below.

  • dememorizationStepsNumber of steps of to “burn-in” the Markov chain before statistics are collected.[Default: 2000]

  • samplingNumNumber of Markov chain samples [Default: 1000].

  • samplingSizeMarkov chain sample size[Default: 1000].

Note that the total number of steps in the Monte-Carlo Markov chain is the product of samplingNum and samplingSize, so the default values described above would contain 1,000,000 (= 1000 x 1000) steps in the MCMC chain.

The default values for options described above have proved to be optimal for us and if the options are not provided these defaults will be used. If you change the values and have problems, please let us know.

5

[HomozygosityEWSlatkinExact]

The presence of this section enables Slatkin's (1994) implementation of the Ewens-Watterson exact test of neutrality.

  • numReplicatesThe default values have proved to be optimal for us. There is no reason to change them unless you are particularly curious. If you change the default values and have problems, please let us know.

6

[Emhaplofreq]

The presence of this section enables haplotype estimation and calculation of linkage disequilibrium (LD) measures.

  • lociToEstHaploIn this option you can list the multi-locus haplotypes for which you wish the program to estimate and to calculate the LD. It should be a comma-separated list of colon-joined loci. e.g.,

    lociToEstHaplo=a:b:drb1,a:b:c,drb1:dqa1:dpb1,drb1:dqb1:dpb1
  • allPairwiseLDSet this to 1 (one) if you want the program to calculate all pairwise LD for your data, otherwise set this to 0 (zero).

  • allPairwiseLDWithPermuSet this to a positive integer greater than 1 if you need to determine the significance of the pairwise LD measures in the previous section. The number you use is the number of permutations that will be run to ascertain the significance (this should be at least 1000 or greater). (Note this is done via permutation testing performed after the pairwise LD test for all pairs of loci. Note also that this test can take DAYS if your data is highly polymorphic.)

  • numPermuInitCondSet this to change the number of initial conditions used per permutation. [Default: 5]. (Note: this parameter is only used if allPairwiseLDWithPermu is set and nonzero).

2.3.2. Advanced options

The following section describes additional options to previously described sections. Most of the time these options can be omitted and PyPop will choose defaults, however these advanced options do offer greater control over the application. In particular, customization will be required for data that has sample identifiers as in Example 2.2, “Multi-locus allele-level HLA genotype data with sample information” or header data block as in Example 2.3, “Multi-locus allele-level HLA genotype data with sample and header information” and both validSampleFields (described above) and validPopFields (described below) will need to be modified.

It also describes two extra sections related to using PyPop in conjunction with Arlequin: [Arlequin] and [HardyWeinbergGuoThompsonArlequin].

[General] advanced options

  • txtOutFilename and xmlOutFilename If you wish to specify a particular name for the output file, which you want to remain identical over several runs, you can set these two items to particular values. The default is to have the program select the output filename, which can be controlled by the next variable. [Default: not used]

  • outFilePrefixTypeThis option can either be omitted entirely (in which case the default will be filename) or be set in several ways. The default is set as filename, which will result in three output files named original-filename-minus-suffix-out.xml, original-filename-minus-suffix-out.txt, and original-filename-minus-suffix-filter.xml. [Default: filename]

    If you set the value to date instead of filename, you'll get the date incorporated in the filename as follows: original-filename-minus-suffix-YYYY-nn-dd-HH-MM-SS-out.{xml,txt}. e.g., USAFEL-UchiTelle-2003-09-21-01-29-35-out.xml (where Y, n, d, H, M, S refer to year, month, day, hour, minute and second, respectively).

  • xslFilenameThis option specifies where to find the XSLT file to use for transforming PyPop's xml output into human-readable form. Most users will not normally need to set this option, and the default is the system-installed text.xsl file.

[ParseGenotypeFile] advanced options

  • fieldPairDesignatorThis option allows you to override the coding for the headers for each pair of alleles at each locus; it must match the entry in the config file under validSampleFields and the entries in your population data file. If you want to use something other than _1 and _2, change this option, for instance, to use letters and parentheses, change it as follows: fieldPairDesignator=(a):(b) [Default: _1:_2]

  • popNameDesignatorThere is a special designator to mark the population name field, which is usually the first field in the data block. [Default: +]

    If you are analyzing data that contains a population name for each sample, then the first entry in your validSampleFields section should have a prefixed +, as below:

    validSampleFields=+populat
     *a_1
     *a_2
     ...
  • validPopFieldsIf you are analyzing data with an initial two line population header block information as in Example 2.3, “Multi-locus allele-level HLA genotype data with sample and header information”, then you will need to set this option. In this case, it should contain the field names in the first line of the header information of your file. [Default: required when a population data-block is present in data file], e.g.:

    validPopFields=labcode
     method
     ethnic
     country
     latit
     longit

[Emhaplofreq] advanced options

  • permutationPrintFlagDetermines whether the likelihood ratio for each permutation will be logged to the XML output file, this is disabled by default. [Default: 0 (OFF)].

    [Warning]Warning

    If this is enabled it can drastically increase the size of the output XML file on the order of the product of the number of possible pairwise comparisons and permutations. Machines with lower RAM and disk space may have difficulty coping with this.

[Arlequin] extra section

This section sets characteristics of the Arlequin application if it has been installed (it must be installed separately from PyPop as we cannot distribute it). The options in this section are only used when a test requiring Arlequin, such as it's implementation of Guo and Thompson's (1992) Hardy-Weinberg exact test is invoked (see below).

  • arlequinExecThis option specifies where to find the Arlequin executable on your system. The default assumes it is on your system path. [Default: arlecore.exe]

[HardyWeinbergGuoThompsonArlequin] extra section

When this section is present, Arlequin's implementation of the Hardy-Weinberg exact test is run, using a Monte-Carlo Markov Chain implementation. By default this section is not enabled.

  • markovChainStepsHWLength of steps in the Markov chain [Default: 2500000].

  • markovChainDememorisationStepsHWNumber of steps of to “burn-in” the Markov chain before statistics are collected.[Default: 5000]

The default values for options described above have proved to be optimal for us and if the options are not provided these defaults will be used. If you change the values and have problems, please let us know.

[Filters] extra section

When this section is present, it allows you to specify succesive filters to the data.

  • filtersToApplyHere you specify which filters you want applied to the data and the order in which you want them applied. Separate each filter name with a colon (:). Currently there are four predefined filter: AnthonyNolan, Sequence, DigitBinning, and CustomBinning. If you specify one or more of these filters, you will get the default behavior of the filter. If you wish to modify the default behavior, you should add a section with the same name as the specified filter(s). See next section for more on this. Please note that, while you are allowed to specify any ordering for the filters, some orderings may not make sense. For example, the ordering Sequence:AnthonyNolan would not make sense (because as far as PyPop is concerned, your alleles are now amino acid residues.) However, the reverse ordering, AnthonyNolan:Sequence, would be logical and perhaps even advisable.

[AnthonyNolan] filter section

This section is only useful for HLA data. Like all filter sections, it will only be used if present in the filtersToApply line specified above. If so enabled, your data will be filtered through the Anthony Nolan database of known HLA allele names before processing. The data files this filter relies on are not currently distributed with PyPop but can be obtained via the IMGT ftp site. Invocation of this filter will produce a popfile-filter.xml file output showing what was resolved and what could not be resolved.

  • alleleFileFormatThis options specifies which of the formats the Anthony Nolan allele data will be used. The option can be set to either txt (for the plain free text format) or msf (for the Multiple Sequence Format) [Default: msf]

  • directorySpecifies the path to the root of the sequence files. For txt: [Default: prefix/share/PyPop/anthonynolan/HIG-seq-pep-text/]. For msf files [Default: prefix/share/PyPop/anthonynolan/msf/].

  • preserve-ambiguousThe default behavior of the AnthonyNolan filter is to ignore allele ambiguity ("slash") notation. This notation, common in the literature, looks like: 010101/0102/010301. The default behavior will simply truncate this to 0101. If you want to preserve the notation, set the option to 1. This will result in a filtered allele "name" of 0101/0102/0103 in the above hypothetical example. [Default: 0].

  • preserve-unknownThe default behavior of the AnthonyNolan filter is to replace unknown alleles with the untypedAllele designator. If you want the filter to keep allele names it does not recognize, set the option to 1. [Default: 0].

  • preserve-lowresThis option is similar to preserve-unknown, but only applies to lowres alleles. If set to 1, PyPop will keep allele names that are shorter than the default allele name length, usually 4 digits long. But if the preserve-unknown flag is set, this one has no effect, because all unknown alleles are preserved. [Default: 0].

[Sequence] filter section

This section allows configuration of the sequence filter. Like all filter sections, it will only will be used if present in the filtersToApply line specified above. If so enabled, your allele names will be translated into sequences, and all ensuing analyses will consider each position in the sequence to be a distinct locus. This filter makes use of the same msf format alignment files as used above in the AnthonyNolan filter. It does not work with the txt format alignment files.

  • sequenceFileSuffixDetermines the files that will be examined in order to read in a sequence for each allele. (ie, if the file for locus A is A_prot.msf, the value would be _prot whereas if you wanted to use the nucleotide sequence files, you might use _nuc.) [Default: _prot].

  • directorySpecifies the path to the root of the sequence files, in the same manner as in the AnthonyNolan section, above.

[DigitBinning] filter section

This section allows configuration of the DigitBinning filter. Like all filter sections, it will be used if present in the filtersToApply line specified above. If so enabled, your allele names will be truncated after the nth digit.

  • binningDigitsAn integer that specifies how many digits to keep after the truncation. [Default: 4].

[CustomBinning] filter section

This section allows configuration of the CustomBinning filter. Like all filter sections, it will only be used if present in the filtersToApply line specified above.

You can provide a set of custom rules for replacing allele names. Allele names should be separated by / marks. This filter matches any allele names that are exactly the same as the ones you list here, and will also find "close matches" (but only if there are no exact matches.). Here is an example:

A=01/02/03
 04/05/0306
 !06/1201/1301
 !07/0805

In the example above, A*03 alleles will match to 01/02/03, except for A*0306, which will match to 04/05/0306. If you place a ! mark in front of the first allele name, that first name will be used as the "new name" for the binned group (for example, A*0805 will be called 07 in the custom-binned data.) Note that the space at the beginning of the lines (following the first line of each locus) is important. The above rules are just dummy examples, provided to illustrate how the filter works. PyPop is distributed with a biologically relevant set of CustomBinning rules that have been compiled from several sources[2]



[2] Mack et al. (2007); Cano (2007); The Anthony Nolan list of deleted allele names (http://www.anthonynolan.com/HIG/lists/delnames.html); and the Ambiguous Allele Combinations, release 2.18.0 (http://www.ebi.ac.uk/imgt/hla/ambig.html).