Generating Correlated Data

The following is a reply to a quiry on how to generate correlated data. It was sent by David Nichols, at SPSS, and represents the clearest statement I have seen in answer to this frequently asked question. I have copied his reply, and am making it available here, simply to save you having to search the archives of the edstat-l list. Don't give up when you come across "upper triangular Cholesky decomposition," read on.

I have included a specific implementation of this idea (written for SPSS), which you can look at to see what's going on. It generates data from a population where all the pairwise correlations are .50.

Date: Fri, 20 Sep 1996 20:44:10 -0400
Reply-To: nichols@spss.com
Originator: edstat-l@jse.stat.ncsu.edu
Sender: edstat-l@jse.stat.ncsu.edu
Precedence: bulk
From: nichols@spss.com (David Nichols)
To: Multiple recipients of list <edstat-l@jse.stat.ncsu.edu>
Subject: Re: generating correlated numbers
X-Comment: Statistics Education Discussion

In article <Santosh_Kumar-2009961451350001@medusa.cog.brown.edu>, Santosh Kumar <Santosh_Kumar@brown.edu> wrote:

hi: i need to generate numbers for two variables that have a particular correlation coefficient r. Is there an easy way to do this? (preferrably using matlab, datadesk or spss).

To which David Nichols responded:

Do you mean that you want two variables created as if they were sampled from a population with a given correlation, or such that they have that precise value in the sample? Either case can be handled. A general way to do this is to begin with (pseudo) random numbers and use the property that for a set of uncorrelated or uncorrelated in the population (as independent random numbers would be) variables, a given correlation matrix can be imposed by postmultiplying the data matrix X by the upper triangular Cholesky decomposition of the correlation matrix R. For the case of two variables, this has a simple scalar solution that can easily be done in SPSS without having to deal with the MATRIX procedure.

Start with two variables created using the (pseudo) random normal option. If you want the "drawn from a population with correlation of r" version, skip the next step.

For a sample correlation of exactly r, take the two variables and run them through the FACTOR procedure, using a PC (principal components) extraction method, extracting both components, and saving the scores to the data file. These saved scores would be uncorrelated in the sample.

Let's say the two variables have been named (in either case above) X and Y. To create the desired correlation, create a new Y as:

COMPUTE Y=X*r+Y*SQRT(1-r**2)

where r is the desired correlation value. X and Y will now have either the exact correlation desired, or if you didn't do the FACTOR step, if you do this a large number of times, the distribution of correlations will be centered on r.

The more general version of this simply requires a matrix of variables X to be postmultiplied by the Cholesky decomposition of R, the desired correlation matrix. Assuming variables A to Z in an SPSS data file, use

MATRIX.

GET X /VAR=A TO Z.

COMPUTE R={ }.

COMPUTE NEWX=X*CHOL(R).

where inside the curly brackets you define the structure of R. NEWX can then be saved to a file if desired.

------------------------------------------------------------------------- ----

David Nichols Senior Support Statistician SPSS, Inc.

Phone: (312) 329-3684 Internet: nichols@spss.com Fax: (312) 329-3668

------------------------------------------------------------------------- ----