The sets of population genetic analyses that are run on your
population data file and the manner in which the data file is
interpreted by PyPop
is controlled by a
configuration file, the default name for which is
config.ini
. This is another plain text file
consisting of comments (which are lines that
start with a semi-colon), sections (which
are lines with labels in square brackets), and
options (which are lines specifying settings
relevant to that section in the
format).option
=value
![]() | Note |
---|---|
If any option runs over one line (such as
|
Here we present a minimal .ini
file
corresponding to Example 2.1, “Multi-locus allele-level genotype data” A
section by section review of this file follows. (Note comment
lines have been omitted in the above example for clarity). A
description of more advanced options is contained in Section 2.3.2, “Advanced options”.
Example 2.7. Minimal config.ini
file
[General]debug=0 [ParseGenotypeFile]
untypedAllele=**** alleleDesignator=* validSampleFields=*a_1 *a_2 *c_1 *c_2 *b_1 *b_2 [HardyWeinberg]
lumpBelow=5 [HardyWeinbergGuoThompson]
dememorizationSteps=2000 samplingNum=1000 samplingSize=1000 [HomozygosityEWSlatkinExact]
numReplicates=10000 [Emhaplofreq]
allPairwiseLD=1 allPairwiseLDWithPermu=0 ;;numPermuInitCond=5
Configuration file sections
This section contains variables that control the overall
behavior of
| |
Specifying data formats There are two possible formats:
| |
Hardy-Weinberg analysis is enabled by the presence of this section.
| |
When this section is present, an implementation of the Hardy-Weinberg exact test is run using the original (Guo & Thompson, 1992) code, using a Monte-Carlo Markov chain (MCMC). In addition, two measures (Chen and Diff) of the goodness of fit of individual genotypes are reported under this option (Chen et al., 1999) By default this section is not enabled. This is a different implementation to the Arlequin version listed in Section 2.3.2, “Advanced options”, below.
Note that the total number
of steps in the Monte-Carlo Markov chain is the product of
The default values for options described above have proved to be optimal for us and if the options are not provided these defaults will be used. If you change the values and have problems, please let us know. | |
The presence of this section enables Slatkin's (1994) implementation of the Ewens-Watterson exact test of neutrality.
| |
The presence of this section enables haplotype estimation and calculation of linkage disequilibrium (LD) measures.
|
The following section describes additional options to
previously described sections. Most of the time these options
can be omitted and PyPop
will choose
defaults, however these advanced options do offer greater control
over the application. In particular, customization will be
required for data that has sample identifiers as in Example 2.2, “Multi-locus allele-level HLA genotype data with sample
information” or header data block as in
Example 2.3, “Multi-locus allele-level HLA genotype data with sample and
header information” and both
validSampleFields
(described above) and
validPopFields
(described below) will need to
be modified.
It also describes two extra sections related to using
PyPop
in conjunction with
Arlequin
: [Arlequin]
and [HardyWeinbergGuoThompsonArlequin]
.
[General]
advanced
optionstxtOutFilename
. and
xmlOutFilename
If you wish to specify a particular name for the output
file, which you want to remain identical over several runs, you
can set these two items to particular values. The default is to
have the program select the output filename, which can be
controlled by the next variable. [Default: not used]
outFilePrefixType
. This option can either be omitted entirely (in which case
the default will be filename
) or be set in
several ways. The default is set as
filename
, which will result in three output
files named
,
original-filename-minus-suffix
-out.xml
,
and
original-filename-minus-suffix
-out.txt
.
[Default:
original-filename-minus-suffix
-filter.xmlfilename
]
If you set the value to date
instead of
filename, you'll get the date incorporated in the filename as
follows:
.
e.g., original-filename-minus-suffix
-YYYY-nn-dd-HH-MM-SS-out.{xml,txt}USAFEL-UchiTelle-2003-09-21-01-29-35-out.xml
(where Y, n, d, H, M, S refer to year, month, day, hour, minute
and second, respectively).
xslFilename
. This option specifies where to find the XSLT file to use
for transforming PyPop
's xml output
into human-readable form. Most users will not normally need to
set this option, and the default is the system-installed
text.xsl
file.
[ParseGenotypeFile]
advanced
optionsfieldPairDesignator
. This option allows you to override the coding for the
headers for each pair of alleles at each locus; it must match
the entry in the config file under
validSampleFields
and the entries in your
population data file. If you want to use something other than
_1
and _2
, change this
option, for instance, to use letters and parentheses, change it
as follows: fieldPairDesignator=(a):(b)
[Default:
_1:_2
]
popNameDesignator
. There is a special designator to mark the population name
field, which is usually the first field in the data block.
[Default:
+
]
If you are analyzing data that contains a population name
for each sample, then the first entry in your
validSampleFields
section should have a
prefixed +, as below:
validSampleFields=+populat *a_1 *a_2 ...
validPopFields
. If you are analyzing data with an initial two line
population header block information as in Example 2.3, “Multi-locus allele-level HLA genotype data with sample and
header information”, then you will need to set this option.
In this case, it should contain the field names in the first line of the
header information of your file. [Default: required when
a population data-block is present in data file], e.g.:
validPopFields=labcode method ethnic country latit longit
[Emhaplofreq]
advanced
optionspermutationPrintFlag
. Determines whether the likelihood ratio for each
permutation will be logged to the XML output file, this is
disabled by default. [Default:
0
(OFF)].
![]() | Warning |
---|---|
If this is enabled it can drastically increase the size of the output XML file on the order of the product of the number of possible pairwise comparisons and permutations. Machines with lower RAM and disk space may have difficulty coping with this. |
[Arlequin]
extra
sectionThis section sets characteristics of the
Arlequin
application if it has been
installed (it must be installed separately from
PyPop
as we cannot distribute it). The
options in this section are only used when a test requiring
Arlequin
, such as it's implementation
of Guo and Thompson's (1992) Hardy-Weinberg
exact test is invoked (see below).
arlequinExec
. This option specifies where to find the
Arlequin
executable on your system.
The default assumes it is on your system path. [Default:
arlecore.exe
]
[HardyWeinbergGuoThompsonArlequin]
extra sectionWhen this section is present, Arlequin's implementation of the Hardy-Weinberg exact test is run, using a Monte-Carlo Markov Chain implementation. By default this section is not enabled.
markovChainStepsHW
. Length of steps in the Markov chain [Default: 2500000].
markovChainDememorisationStepsHW
. Number of steps of to “burn-in” the Markov
chain before statistics are collected.[Default: 5000]
The default values for options described above have proved to be optimal for us and if the options are not provided these defaults will be used. If you change the values and have problems, please let us know.
[Filters]
extra sectionWhen this section is present, it allows you to specify succesive filters to the data.
filtersToApply
. Here you specify which filters you want applied to the
data and the order in which you want them applied.
Separate each filter name with a colon (:
).
Currently there are four predefined filter:
AnthonyNolan
,
Sequence
,
DigitBinning
, and
CustomBinning
. If you specify one or more
of these filters, you will get the default behavior of the
filter. If you wish to modify the default behavior, you should
add a section with the same name as the specified filter(s).
See next section for more on this. Please note that, while you are
allowed to specify any ordering for the filters, some orderings may
not make sense. For example, the ordering Sequence:AnthonyNolan
would not make sense (because as far as PyPop is concerned, your
alleles are now amino acid residues.) However, the reverse ordering,
AnthonyNolan:Sequence, would be logical and perhaps even advisable.
[AnthonyNolan]
filter sectionThis section is only useful for HLA
data. Like all filter sections, it will only be used if present in the
filtersToApply
line specified above. If so
enabled, your data will be filtered through the Anthony Nolan
database of known HLA allele names before processing. The data files this
filter relies on are not currently
distributed with PyPop
but can be
obtained via the IMGT ftp
site. Invocation of this filter will produce a
file output showing what was resolved and what could not be
resolved. popfile
-filter.xml
alleleFileFormat
. This options specifies which of the formats the Anthony
Nolan allele data will be used. The option can be set to
either txt
(for the plain free text format)
or msf
(for the Multiple
Sequence Format) [Default:
msf
]
directory
. Specifies the path to the root of the sequence files. For
txt
: [Default:
].
For prefix
/share/PyPop/anthonynolan/HIG-seq-pep-text/msf
files [Default:
].prefix
/share/PyPop/anthonynolan/msf/
preserve-ambiguous
. The default behavior of the
AnthonyNolan
filter is to ignore allele
ambiguity ("slash") notation. This notation, common in the
literature, looks like: 010101/0102/010301. The default
behavior will simply truncate this to 0101. If you want to
preserve the notation, set the option to 1. This will result
in a filtered allele "name" of 0101/0102/0103 in the above
hypothetical example. [Default: 0
].
preserve-unknown
. The default behavior of the AnthonyNolan
filter is to
replace unknown alleles with the untypedAllele
designator.
If you want the filter to keep allele names it does not
recognize, set the option to 1. [Default: 0
].
preserve-lowres
. This option is similar to preserve-unknown
, but only
applies to lowres alleles. If set to 1, PyPop will keep allele
names that are shorter than the default allele name length,
usually 4 digits long. But if the preserve-unknown flag is
set, this one has no effect, because all unknown alleles are
preserved. [Default: 0
].
[Sequence]
filter sectionThis section allows configuration of the sequence filter.
Like all filter sections, it will only will be used if present in the
filtersToApply
line specified above. If so
enabled, your allele names will be translated into sequences, and all
ensuing analyses will consider each position in the sequence to be a
distinct locus. This filter makes use of the same msf format alignment
files as used above in the AnthonyNolan filter. It does not work with
the txt format alignment files.
sequenceFileSuffix
. Determines the files that will be examined in order to read in
a sequence for each allele. (ie, if the file for locus A is
A_prot.msf, the value would be _prot whereas if you wanted to use
the nucleotide sequence files, you might use _nuc.) [Default: _prot].
directory
. Specifies the path to the root of the sequence files, in the same
manner as in the AnthonyNolan section, above.
[DigitBinning]
filter sectionThis section allows configuration of the DigitBinning filter. Like
all filter sections, it will be used if present in the
filtersToApply
line specified above. If so
enabled, your allele names will be truncated after the nth digit.
binningDigits
. An integer that specifies how many digits to keep after the truncation.
[Default: 4].
[CustomBinning]
filter sectionThis section allows configuration of the CustomBinning filter. Like all
filter sections, it will only be used if present in the
filtersToApply
line specified above.
You can provide a set of custom rules for replacing allele
names. Allele names should be separated by /
marks. This filter
matches any allele names that are exactly the same as the ones you
list here, and will also find "close matches" (but only if there
are no exact matches.). Here is an example:
A=01/02/03 04/05/0306 !06/1201/1301 !07/0805
In the example above, A*03
alleles will
match to 01/02/03
, except for
A*0306
, which will match to
04/05/0306
. If you place a
!
mark in front of the first allele name, that
first name will be used as the "new name" for the binned group
(for example, A*0805
will be called
07
in the custom-binned data.) Note that the
space at the beginning of the lines (following the first line of
each locus) is important. The above rules are just dummy examples,
provided to illustrate how the filter works.
PyPop
is distributed with a
biologically relevant set of CustomBinning
rules that have been compiled from several sources[2]
[2] Mack et al. (2007); Cano (2007); The Anthony Nolan list of deleted allele names (http://www.anthonynolan.com/HIG/lists/delnames.html); and the Ambiguous Allele Combinations, release 2.18.0 (http://www.ebi.ac.uk/imgt/hla/ambig.html).