2.2. The data file

2.2. The data file
Prev	Chapter 2. Getting started with PyPop	Next

2.2.1. Sample files

Data can be input either as genotypes, or in an allele count format, depending on the format of your data.

As you will see in the following examples, population files begin with header information. In the simplest case, the first line contains the column headers for the genotype, allele count, or, sequence information from the population. If the file contains a population data-block, then the first line consists of headers identifying the data on the second line, and the third line contains the column headers for the genotype or allele count information.

Note that for genotype data, each locus corresponds to two columns in the population file. The locus name must repeated, with a suffix such as _1, _2 (the default) or _a, _b and must match the format defined in the config.ini (see validSampleFields). Although PyPop needs this distinction to be made, phase is NOT assumed, and if known it is ignored.

Example 2.7, “Minimal config.ini file” shows the relevant lines for the configuration to read in the data shown in Example 2.1, “Multi-locus allele-level genotype data” through to Example 2.6, “Allele count data”.

Example 2.1. Multi-locus allele-level genotype data

a_1   a_2   c_1   c_2   b_1   b_2
****  ****  0102  02025 1301  18012 
0101  0201  0307  0605  1401  39021 
0210  03012 0712  0102  1520  1301  
0101  0218  0804  1202  35091 4005  
2501  0201  1507  0307  51013 1401  
0210  3204  1801  0102  78021 1301  
03012 3204  1507  0605  51013 39021

This is an example of the simplest kind of data file.

Example 2.2. Multi-locus allele-level HLA genotype data with sample information

populat    id        a_1   a_2   c_1   c_2   b_1   b_2
UchiTelle  UT900-23  ****  ****  0102  02025 1301  18012 
UchiTelle  UT900-24  0101  0201  0307  0605  1401  39021 
UchiTelle  UT900-25  0210  03012 0712  0102  1520  1301  
UchiTelle  UT900-26  0101  0218  0804  1202  35091 4005  
UchiTelle  UT910-01  2501  0201  1507  0307  51013 1401  
UchiTelle  UT910-02  0210  3204  1801  0102  78021 1301  
UchiTelle  UT910-03  03012 3204  1507  0605  51013 39021

This example shows a data file which has non-allele data in some columns, here we have population (populat) and sample identifiers (id).

Example 2.3. Multi-locus allele-level HLA genotype data with sample and header information

labcode method              ethnic  contin  collect        latit           longit          
USAFEL  12th Workshop SSOP  Telle   NW Asia Targen Village 41 deg 12 min N 94 deg 7 min E  
populat     id         a_1     a_2     c_1     c_2     b_1     b_2     
UchiTelle   UT900-23   ****    ****    0102    02025   1301    18012   
UchiTelle   UT900-24   0101    0201    0307    0605    1401    39021   
UchiTelle   UT900-25   0210    03012   0712    0102    1520    1301    
UchiTelle   UT900-26   0101    0218    0804    1202    35091   4005    
UchiTelle   UT910-01   2501    0201    1507    0307    51013   1401    
UchiTelle   UT910-02   0210    3204    1801    0102    78021   1301    
UchiTelle   UT910-03   03012   3204    1507    0605    51013   39021

This is an example of a data file which is identical to Example 2.2, “Multi-locus allele-level HLA genotype data with sample information”, but which includes population level information.

Example 2.4. Multi-locus allele-level HLA genotype and microsatellite genotype data with header information

labcode ethnic  complex
USAFEL  ****    0
populat    id      drb1_1  drb1_2  dqb1_1  dqb1_2  d6s2222_1  d6s2222_2  
UchiTelle  HJK_2   01      0301    0201     0501    249        249        
UchiTelle  HJK_1   0301    0301    0201     0201    249        249        
UchiTelle  HJK_3   01      0301    0201     0501    249        249        
UchiTelle  HJK_4   01      0301    0201     0501    249        249        
UchiTelle  MYU_2   02      0401    0302     0602    247        249        
UchiTelle  MYU_1   0301    0301    0201     0201    247        249        
UchiTelle  MYU_3   0301    0401    0201     0302    249        249        
UchiTelle  MYU_4   0301    0401    0201     0302    247        249

This example mixes different kinds of data: HLA allele data (from DRB1 and DQB1 loci) with microsatellite data (locus D6S2222).

Example 2.5. Sequence genotype data with header information

labcode file                                                
BLOGGS  C_New
popName ID       TGFB1cdn10(1) TGFB1cdn10(2) TGFBhapl(1) TGFBhapl(2) 
Urboro  XQ-1     C             T             CG          TG     
Urboro  XQ-2     C             C             CG          CG     
Urboro  XQ-5     C             T             CG          TG     
Urboro  XQ-21    C             T             CG          TG     
Urboro  XQ-7     C             T             CG          TG     
Urboro  XQ-20    C             T             CG          TG     
Urboro  XQ-6     T             T             TG          TG     
Urboro  XQ-8     C             T             CG          TG     
Urboro  XQ-9     T             T             TG          TG     
Urboro  XQ-10    C             T             CG          TG

This example includes nucleotide sequence data: the TGFB1CDN10 locus consists of one nucleotide, the TGFBhapl locus is actually haplotype data, but PyPop simply treats each combination as a separate "allele" for subsequent analysis.

Example 2.6. Allele count data

populat    method  ethnic     country    latit   longit
UchiTelle  PCR-SSO Klingon    QZ         052.81N 100.25E
dqa1  count
0101  31
0102  37
0103  17
0201  21
0301  32
0401  9
0501  35

PyPop can also process allele count data. However, you cannot mix allele count data and genotype data together in the one file.

	Note
	Currently each `.pop` file can only contain allele count data for one locus. In order to process multiple loci for one population you must create a separate `.pop` for each locus.

These population files are plain text files, such as you might save out of the Notepad application on Windows (or Emacs). The columns are all tab-delimited, so you can include spaces in your labels. If you have your data in a spreadsheet application, such as Excel or OpenOffice.org, export the file as tab-delimited text, in order to use it as PyPop data file.

2.2.2. Missing data

Untyped or missing data may be represented in a variety of ways. The default value for untyped or missing data is a series of four asterisks (****) as specified by the config.ini. You may not "represent" untyped data by leaving a column blank, nor may you represent a homozygote by leaving the second column blank. All cells for which you have data must include data, and all cells for which you do not have data must also be filled in, using a missing data value.

For individuals who were not typed at all loci, the data in loci for which they are typed will be used on all single-locus analyses for that individual and locus, so that you see the value of the number of individuals (n) vary from locus to locus in the output. These individuals' data will also be used for multi-locus analyses. Only the loci that contain no missing data will be included in any multi-locus analysis.

If an individual is only partially typed at a locus, it will be treated as if it were completely untyped, and data for that individual for that locus will be dropped from ALL analyses.

Current limitations of PyPop

	Current limitations of PyPop
Do not leave trailing blank lines at the end of your data file, as this currently causes `PyPop` to terminate with an error message that takes experience to diagnose. For haplotype estimation and linkage disequilibrium calculations (i.e., the emhaplofreq part of the program) you are currently restricted to a maximum of seven loci per haplotype request. For haplotype estimation there is a limit of 5000 for the number of individuals (`n`)^[1]

Do not leave trailing blank lines at the end of your data file, as this currently causes PyPop to terminate with an error message that takes experience to diagnose.
For haplotype estimation and linkage disequilibrium calculations (i.e., the emhaplofreq part of the program) you are currently restricted to a maximum of seven loci per haplotype request. For haplotype estimation there is a limit of 5000 for the number of individuals (n)^[1]

^[1]These hardcoded numbers can be changed if you obtain the source code yourself and change the appropriate #define emhaplofreq.h and recompile the program.