Data can be input either as genotypes, or in an allele count format, depending on the format of your data.
As you will see in the following examples, population files begin with header information. In the simplest case, the first line contains the column headers for the genotype, allele count, or, sequence information from the population. If the file contains a population data-block, then the first line consists of headers identifying the data on the second line, and the third line contains the column headers for the genotype or allele count information.
Note that for genotype data, each locus corresponds to two
columns in the population file. The locus name must repeated,
with a suffix such as _1
, _2
(the default) or _a
, _b
and
must match the format defined in the
config.ini
(see validSampleFields). Although
PyPop
needs this distinction to be
made, phase is NOT assumed, and if known it is ignored.
Example 2.7, “Minimal config.ini file” shows the relevant lines for the configuration to read in the data shown in Example 2.1, “Multi-locus allele-level genotype data” through to Example 2.6, “Allele count data”.
Example 2.1. Multi-locus allele-level genotype data
a_1 a_2 c_1 c_2 b_1 b_2 **** **** 0102 02025 1301 18012 0101 0201 0307 0605 1401 39021 0210 03012 0712 0102 1520 1301 0101 0218 0804 1202 35091 4005 2501 0201 1507 0307 51013 1401 0210 3204 1801 0102 78021 1301 03012 3204 1507 0605 51013 39021
This is an example of the simplest kind of data file.
Example 2.2. Multi-locus allele-level HLA genotype data with sample information
populat id a_1 a_2 c_1 c_2 b_1 b_2 UchiTelle UT900-23 **** **** 0102 02025 1301 18012 UchiTelle UT900-24 0101 0201 0307 0605 1401 39021 UchiTelle UT900-25 0210 03012 0712 0102 1520 1301 UchiTelle UT900-26 0101 0218 0804 1202 35091 4005 UchiTelle UT910-01 2501 0201 1507 0307 51013 1401 UchiTelle UT910-02 0210 3204 1801 0102 78021 1301 UchiTelle UT910-03 03012 3204 1507 0605 51013 39021
This example shows a data file which has non-allele data in
some columns, here we have population (populat
)
and sample identifiers (id
).
Example 2.3. Multi-locus allele-level HLA genotype data with sample and header information
labcode method ethnic contin collect latit longit USAFEL 12th Workshop SSOP Telle NW Asia Targen Village 41 deg 12 min N 94 deg 7 min E populat id a_1 a_2 c_1 c_2 b_1 b_2 UchiTelle UT900-23 **** **** 0102 02025 1301 18012 UchiTelle UT900-24 0101 0201 0307 0605 1401 39021 UchiTelle UT900-25 0210 03012 0712 0102 1520 1301 UchiTelle UT900-26 0101 0218 0804 1202 35091 4005 UchiTelle UT910-01 2501 0201 1507 0307 51013 1401 UchiTelle UT910-02 0210 3204 1801 0102 78021 1301 UchiTelle UT910-03 03012 3204 1507 0605 51013 39021
This is an example of a data file which is identical to Example 2.2, “Multi-locus allele-level HLA genotype data with sample information”, but which includes population level information.
Example 2.4. Multi-locus allele-level HLA genotype and microsatellite genotype data with header information
labcode ethnic complex USAFEL **** 0 populat id drb1_1 drb1_2 dqb1_1 dqb1_2 d6s2222_1 d6s2222_2 UchiTelle HJK_2 01 0301 0201 0501 249 249 UchiTelle HJK_1 0301 0301 0201 0201 249 249 UchiTelle HJK_3 01 0301 0201 0501 249 249 UchiTelle HJK_4 01 0301 0201 0501 249 249 UchiTelle MYU_2 02 0401 0302 0602 247 249 UchiTelle MYU_1 0301 0301 0201 0201 247 249 UchiTelle MYU_3 0301 0401 0201 0302 249 249 UchiTelle MYU_4 0301 0401 0201 0302 247 249
This example mixes different kinds of data: HLA allele data (from DRB1 and DQB1 loci) with microsatellite data (locus D6S2222).
Example 2.5. Sequence genotype data with header information
labcode file BLOGGS C_New popName ID TGFB1cdn10(1) TGFB1cdn10(2) TGFBhapl(1) TGFBhapl(2) Urboro XQ-1 C T CG TG Urboro XQ-2 C C CG CG Urboro XQ-5 C T CG TG Urboro XQ-21 C T CG TG Urboro XQ-7 C T CG TG Urboro XQ-20 C T CG TG Urboro XQ-6 T T TG TG Urboro XQ-8 C T CG TG Urboro XQ-9 T T TG TG Urboro XQ-10 C T CG TG
This example includes nucleotide sequence data: the TGFB1CDN10
locus consists of one nucleotide, the TGFBhapl locus is actually
haplotype data, but PyPop
simply
treats each combination as a separate "allele" for subsequent
analysis.
Example 2.6. Allele count data
populat method ethnic country latit longit UchiTelle PCR-SSO Klingon QZ 052.81N 100.25E dqa1 count 0101 31 0102 37 0103 17 0201 21 0301 32 0401 9 0501 35
PyPop
can also process allele
count data. However, you cannot mix allele count data and
genotype data together in the one file.
![]() | Note |
---|---|
Currently each |
These population files are plain text files, such as you
might save out of the Notepad
application on Windows (or Emacs
). The
columns are all tab-delimited, so you can include spaces in your
labels. If you have your data in a spreadsheet application, such
as Excel
or
OpenOffice.org
, export the file as
tab-delimited text, in order to use it as
PyPop
data file.
Untyped or missing data may be represented in a variety of
ways. The default value for untyped or missing data is a series
of four asterisks (****
) as specified by the
config.ini
. You may not "represent" untyped
data by leaving a column blank, nor may you represent a homozygote
by leaving the second column blank. All cells for which you have
data must include data, and all cells for which you do not have
data must also be filled in, using a missing data value.
For individuals who were not typed at all loci, the data in
loci for which they are typed will be used on all single-locus
analyses for that individual and locus, so that you see the value
of the number of individuals (n
) vary from
locus to locus in the output. These individuals' data will also
be used for multi-locus analyses. Only the loci that contain no
missing data will be included in any multi-locus analysis.
If an individual is only partially typed at a locus, it will be treated as if it were completely untyped, and data for that individual for that locus will be dropped from ALL analyses.
![]() | Current limitations of PyPop |
---|---|
|
[1] These hardcoded numbers
can be changed if you obtain the source code yourself and
change the appropriate #define
emhaplofreq.h
and
recompile the program.