In this chapter I want to discuss various tests involving t, and for that I need to use examples. But I don't want to "invade" the contents of Statistical Methods for Psychology, 8th ed.. These pages are not meant to replace that book, which stands on its own. At the same time, I want to cover the same general material. I will begin with a set of data from a study by Brian Everitt on the treatment of anorexia. The data can be found in Hand et al. (1994) A Handbook of Small Data Sets, Chapman & Hall. To begin with I am only looking at the group that underwent family therapy, though I first read the whole file. The code below will read the data from all three groups and then select out subjects in the Family Therapy group
But before I begin, I want to illustrate the idea of the sampling distribution of the mean, and show how that distribution narrows as the sample size increases. So let's set Everitt's data aside for just a moment.
First I will create a very large population of scores generated at random from a chi-square distribution. (I chose that distribution because it is far from normal, especially with small df.) It will have a mean of 4.94 , with a standard deviation of 3.17. I do this to illustrate that a distribution of sample means will approach normal even if the population is skewed.
I present the complete code for plotting the results below, and will work through it sequentially.
rm(list = ls()) ### Let y stand for the popilation of 10,000 numbers nreps = 10000 y <- rchisq(nreps, df = 5) # Plot distrib. of y hist(y, col = "blue", density = 10, main = "Chi-square Distribution (5 df)" , breaks = 50) # Sample y repeatedly with sample size = 5 and 30 means5 <- numeric(nreps); means30 <- numeric(nreps) for (i in 1:nreps) { means5[i] <- mean(sample(y,5)) means30[i] <- mean(sample(y,30)) } x <- seq(0,10, .1) # Preparatory to fitting normal distribution xbar5 <- round(mean(means5),2) ; sd5 <- round(sd(means5),2) xbar30 <- round(mean(means30),2) ; sd30 <- round(sd(means30),2) par(mfrow = c(2,2)) #Set up 2 X 2 plotting hist(means5, col = "blue", density = 10, main = "Sampling Distribution\n of mean, n = 5", breaks = 20, freq = FALSE, ylim = c(0,.40), xlim = c(0,15), xlab = "Mean") legend(6, .35, paste("mean =", xbar5), cex = .85, bty = "n") legend(6, .32, paste("sd =", sd5), cex = .85, bty = "n") legend(6, .29, paste("n = 5"), cex = .85, bty = "n") par(new = TRUE) # This will superimpose the following plot. y <- dnorm(x,mean = xbar5, sd = sd5) # Add a normal distribution plot(y~x, type = "l",ylim = c(0,.4), xlim = c(0,15), xlab = "", ylab = "") qqnorm(means5, main = "Q-Q plot for n = 5") qqline(means5) ##### Now repeat this with n = 30. hist(means30, col = "blue", density = 10, main = "Sampling Distribution \n of mean, n = 30", breaks = 20, freq = FALSE, xlim = c(2,8), ylim = c(0,.75), xlab = "Mean") legend(5.5, .6, paste("mean =", xbar30), cex = .85, bty = "n") legend(5.5, .5, paste("sd =", sd30), cex = .85, bty = "n") legend(5.5, .4, paste("n = 5"), cex = .85, bty = "n") par(new = TRUE) y <- dnorm(x,mean = xbar30, sd = sd30) plot(y~x, type = "l", ylim = c(0,.75), xlim = c(2,8),xlab = "", ylab = "") qqnorm(means30, main = "Q-Q plot for n = 30") qqline(means30)
With the code above I created a huge population of scores from which to sample. As I said, I used a chi-squared distribution because I wanted to illustrate what happens when you work with a non-normal (non-symmetrical) distribution. I drew 10,000 samples with (n = 5 and then again with n = 30) from this distribution. My goal was to illustrate how the distributions of means change with increases in sample size. Notice that as the sample size increases, the spread of the distribution of means decrease and moves very much toward a normal distiribution. Notice the Q-Q plots, which illustrate that with increased sample size, the distribution of sample means becomes noticeably more normal. Notice that I used span class = smallcodepar(mfrow = c(2,2) to notice is that I frequently use par(mfrow = c(2,1) , or some other combination of numbers to cut the output screen to the desired shape. This particular command would create a screen with 2 rows and 2 columns. Notice also that I have occasionally condensed the code slightly by combining two commands on the same line with a semicolon between them.
I believe that this is the first time that I have used a command like the following:
for (i in 1:nreps) { means5[i] <- mean(sample(y,5)) means30[i] <- mean(sample(y,30)) }
In the old days this was called a "do loop." It begins with i = 1 and computes means5[1} (the first mean of 5 observations) and means30[1]. Then it sets i to 2 and computes means5[2] and means30[2]. Next we compute means5[3] etc., all the way up to means30[nreps], where nreps = 10,000. So we have 10,000 means of samples of size 5 and 10,000 means of samples of size 30. Then the program moves on and works with those means. It is a very efficient way of doing something 10,000 times and saving the 10,000 answers. The commands that it repeats 10,000 times are those between the { }. Before I can run this, however, I had to create variables (means5 and means30) that will store the means that I calculate. I also had to create a variable (here called y) that R can sample from to get the data for this trip through the loop.
After I have run through the loop and have the 10,000 means for each sample size, I plot them with a histogram. The resulting means run from roughly 0 to 11, so I create a variable (x) that contains the sequence of digits from 0 to 10 in increments of 0.1 . Then, given the obtained mean and standard deviation of the obtained means, I compute the height of the normal distribution for each value of x. Once I have that I can plot y as a function of x. This superimposes the normal distribution on top of the ones that I generated from chi-square. The code for this is included in what was shown above. The resulting plots look like:
Now we can return to the Everrit data on Aanorexia. I want to show a bit more about plotting data, using data from only that group. I could either enter the data directly into the program by hand, or I could take the full set of data that we read in, and select out only those observations that went with the I chose the latter approach.
Notice that after I plot the histogram I set par(new = TRUE), which allows the next plot to be superimposed on the first one. I superimposed a normal distribution. Then I used the density function to superimpose a best-fitting curve, which takes into account that there is a bump in the lower half of the curve.### Assume that data have been loaded and the steps ### programmed above have been conducted. setwd("../DataFiles") Everitt.data <- read.table("EverittFullData.dat", header = T) FT <- subset(Everitt.data, Group == "Fam") #Note "==", not "="
par(mfrow = c(2,1)) hist(FT.Gain, freq = FALSE, ylim = c(0, .08), xlim = c(-10, 25)) par(new = TRUE) # Plot on top of last plot. curve(dnorm(x, mean=mean(FT.Gain), sd=sd(FT.Gain)), col="darkblue", lwd=2, add=TRUE, yaxt="n") par(new = TRUE) plot(density(FT.Gain), main = "", ylim = c(0,.08), xlim = c(-10,25), xlab = "") _____________________________________________________________![]()
I next show a plot of the relationship between the amount of weight a girl gained and the weight at which she started treatment. Does a girl who started out weighing somewhat more than others also gain more than others? (The answer, you will see, is no.) Most of this is clear, but I want to make one point about drawing the regression line. (The straight line which best fits the data.) We first have to calculate what the regression line would be. We do that with reg <- lm(Gain ~ Before) , which you have not seen before and is discussed in Chapter 9. When that code runs it will create an "object" named "reg." It, in turn, will contain not only the correlation, but also the coefficients for the equation of the best fitting regression line, the residuals, the fitted values (both of which you don't yet need to know about), and all sorts of other stuff. When I type abline(reg), R knows to look in reg, find the regression coefficients, compute the form of the straight line, and draw it on the figure. That's pretty clever. Notice how I add the correlation in the legend.
rm(list = ls()) par(mfrow = c(2,2)) data <- read.table("Tab6-1.dat", header = TRUE) Before <- data$Before After <- data$After Gain <- After - Before mean(Gain) plot(Before, Gain, main = "Correlation between Amount Gained \n and Pretreatment Weight") reg <- lm(Gain ~ Before) abline(reg) legend(85, 18, paste("r = ",correl), bty = "n")![]()
Now I want to plot the distribution of Family Therapy data, and then use standard confidence intervals and violin plots to explore the data a bit more. The error.bars() command includes all sorts of stuff between the parentheses. Not all of this is needed. I suggest typing ?error.bars to find out what all of this is about.You can easily adapt this to handle the other two conditions. I also include the paired sample t test using difference scores and then again using pre- and post- scores.
### Now lets look at Gain with Family Therapy, the t test, and the error bars library(psych) # Needed for "error.bars.by" par(mfrow = c(2,3)) hist(FT.Gain) error.bars(FT.Gain, eyes = FALSE, xlab = "", ylab = "Weight Gain", ylim= c(-1, 15 ), main = "95% Confidence Limits on \n Weight Gain", arrow.col = "red", alpha = .05) legend(.5, 1.3, "H(0)", bty = "n") # Coordinates are from trial and error error.bars(FT.Gain, eyes = TRUE, xlab = "", ylab = "Weight Gain", ylim= c(-1, 15 ), main = "95% Confidence Limits on \nWeight Gain", col = "lightblue", alpha = .05) legend(0,.9,"______________________________________", bty = "n") legend(.5, 1.3, "H(0)", bty = "n") ### Now to the t-test t.test(FT.Gain, alternative = "two.sided", mu = 0, conf.level = .95) ### OR< EQUIVALENTLY USING BEFORE AND AFTER INSTEAD OF GAIN, t.test(x = FT.After, y = FT.Before, paired = TRUE, alternative = "two.sided", mu = 0, conf.level = .95) __________________________________________________________________ One Sample t-test data: FT.Gain t = 4.1849, df = 16, p-value = 0.0007003 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 3.58470 10.94471 sample estimates: mean of x 7.264706 > > ### OR, EQUIVALENTLY, USING BEFORE AND AFTER INSTEAD OF GAIN, > t.test(x = FT.After, y = FT.Before, paired = TRUE, alternative = "two.sided", + mu = 0, conf.level = .95) __________________________________________________________________ Paired t-test data: FT.After and FT.Before t = 4.1849, df = 16, p-value = 0.0007003 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 3.58470 10.94471 sample estimates: mean of the differences 7.264706 ______________________________________________________________![]()
This code shows several things: 1) the plot of error bars, including violin plots, 2) the calculation of t tests for two related, or dependent samples, either by working with the "before" and "after" data, or working with "gain" scores. 3) The t test, using the t.test() function, produces 95% confidence intervals by default, although you can change that to any interval you want, (Just enter "?t.test" from your console.) I loaded the "psych" library because it handles error bars nicely.
Those data were for the Family Therapy group. Now I will enter data for the Cognitive-Behavior therapy group and the Control group. We can perform the same graphical analyses on these.
If you are using the code from Chapter6.R, you will already have read in the full dataset as anorexia.Data
Control.Before <- c(80.7, 89.4, 91.8, 74.0, 78.1, 88.3, 87.3, 75.1, 80.6, 78.4, 77.6, 88.7, 81.3, 78.1, 70.5, 77.3, 85.2, 86.0, 84.1, 79.7, 85.5, 84.4, 79.6, 77.5, 82.3, 89) Control.After <- c(80.2, 80.1, 86.4, 86.3, 76.1, 78.1, 75.1, 86.7, 73.5, 84.6, 77.4, 79.5, 89.6, 81.4, 81.8, 77.3, 84.2, 75.4, 79.5, 73.0, 88.3, 84.7, 81.4, 81.2, 88.2, 78.8) Control.Gain <- Control.After - Control.Before
CogBehav.Before <- c(80.5, 84.9, 81.5, 82.6, 79.9, 88.7, 94.9, 76.3, 81.0, 80.5, 85.0, 89.2, 81.3, 76.5, 70.0, 80.4, 83.3, 83.0, 87.7, 84.2, 86.4, 76.5, 80.2, 87.8, 83.3, 79.7, 84.5, 80.8, 87.4) CogBehav.After <- c(82.2, 85.6, 81.4, 81.9, 76.4, 103.6, 98.4, 93.4, 73.4, 82.1, 96.7, 95.3, 82.4, 72.5, 90.9, 71.3, 85.4, 8106, 89.1, 83.9, 82.7, 75.7, 82.6, 100.4, 85.2, 83.6, 84.6, 96.2, 86.7) CogBehav.Gain <- CogBehav.After - CogBehav.Before
I use code very similar to the above to do the same thing with the distribution of variances. I wanted to show you that the distribution of the variance is very skewed, which can be seen in both the histogram and the Q-Q plot. To create that code, you basically need to change "mean" to "var" and make a few other minor adjustments. I do not show the code here, but you can see it if you download the R file at www.uvm.edu/dhowell/methods9/R-Code/Chapter5.R.
Next I want to use the data from Zhong to plot the error bars for the Control and Experimental group means, and also for the mean difference. Notice that I did something new here. I didn't want to use that awful attach(), so I used "with." The structure is "with(dataframe,{a bunch of code}). This runs the code on the data frame, similar to what you might get with attach, but safer. The code follows. Notice that I prevented overlapping labels by setting the "main" and "ylab" to blanks in the second section. (This is an example of code that I would copy and file away somewhere so that I can easily pull it out in the future. It is equivalent to replacing "with Zhong.data {" with "attach (Zhong.data), followed by the instructions about error bars,etc., and then replacing the "})" with "detach(Zhong.data). )
### plot confidence interval for Control and Experimental library(psych) par(mfrow = c(1,1)) Zhong.data <- read.table("Tab6-5.dat", header = TRUE) describe(Zhong.data) with (Zhong.data,{ error.bars.by(Response, Condition, by.var = TRUE, eyes = FALSE, lines = FALSE, ylim = c(0,8), xlab = "Condition", ylab = "Political Attitude Score or Difference", main = "Confidence Intervals for Ctl and Trt\n and for Mean Difference") }) ### plot confidence interval in Mean Difference of Control and Experimental diff.stats <- data.frame(mean=0.68,se=0.218,n=c(61)) rownames(diff.stats) <- c("Difference") par(new = TRUE) error.bars(stats=diff.stats,xlab="",ylab="", eyes = FALSE, ylim = c(0,8), main = "")
Now we come to some interesting graphics. I simply wanted to show you where the t distributions in Figure 6.3 came from. I used the function dt(x, df = 5) to ask for the density (height) of the t distribution of varying degrees of freedom. You can't tell the function to usemeans5[i] <- mean(sample(y,5) an infinite df, so I set it at 1000. You couldn't possibly see the difference been that one and one that really did have an infinite number of degrees of freedom. At that point the t and normal distributions are identical.
#Plotting t distribution for various df par(mfrow = c(1,1)) curve(dt(x, df = 1000), xlim = c(-5,5), ylim = c(0, .5)) par(new = TRUE) curve(dt(x, df = 30), xlim = c(-5,5),ylim = c(0, .5)) par(new = TRUE) curve(dt(x, df = 5), xlim = c(-5,5), ylim = c(0, .5)) legend(.60, .47, "df", bty = "n", cex = 1.7) legend(.88, .45, "___", bty = "n") legend(.83, .43, expression(infinity), cex = 1.4, bty = "n") legend(.8, .41, "30", bty = "n", cex = 1.3) legend(1, .39, "5", bty = "n", cex = 1.3)
Now we finally get serious about using t, confidence intervals, and effect sizes. Before I do that, the accompanying code does a couple of other things. One is to repeat the test on Everitt's data and another is to show how samples bounce around with repeated sampling. The latter is similar to Figure 6.5, where we repeated draw from a "population" that reflects the original data. Then we continue by looking at a t test on two independent samples, plotting confidence intervals on mean differences, calculating d, and plotting confidence intervals on d. I don't show those graphics here, but you can see them just by running the code. Finally, I took some data from a study by Helzer and Pizarro (2011), which was another look at the concept of moral purity. (These data are discussed in the text.) I presented a t test on those data. Here I have drawn the "forest plot" of the amalgamated results, but I have not shown the "funnel plot" because I have not described it and won't until Chapter 17.
# Figure 6.7 Plotting confidence intervals over repeated sampling j <- 24 xx <- seq(-5,25,.10) # Setting up the axes yy <- seq(2,9.5,.025) n <- 17 # Sample size plot(yy,xx, type = "n", xlab = "confidence Interval", ylab = "Sample", ylim = c(0,25),bty = "n", xlim =(c(-5, 25))) lines(c(7.26, 7.26),c(25,0)) legend(6.5, 25, legend = expression(mu), bty = "n") for (i in 1:25) { x <- rnorm(n,7.26,7.16) t <- t.test(x) se <- sd(x)/sqrt(n) b <- qt(c(.025, .975),n-1) # b[1] = lower t cutoff, b[2] = upper lower <- mean(x)+ b[1]*se upper <- mean(x)+ b[2]*se lines(c(lower,upper),c(j,j), col = "red") j <- j-1 } lines(c(0, 0),c(0,25), col = "green")##################################################################### ### plot confidence interval for Control and Experimental ### conditions. library(psych) par(mfrow = c(1,2)) error.bars.by(Response, Condition, by.var = TRUE, eyes = FALSE, lines = FALSE, ylim = c(0,8), main = "95% CI by Group", xlab = "Condition", ylab = "Political Attitude Score") ##################################################################### ### plot confidence interval for Control and Experimental diff.stats <- c(0.68, 0.218, 61) diff.stats <- data.frame(mean=0.68,se=0.218,n=c(61)) rownames(diff.stats) <- c("Difference") error.bars(stats=diff.stats,xlab="",ylab="Mean Difference", eyes = FALSE, ylim = c(0,8),main = "95% CI \non Mean Difference") ##################################################################### ### CI on d library(MBESS) conf.intervald <-ci.smd(smd = .80, n.1 = 31, n.2 = 30, conf.level = .95) cat("The confidence interval on our estimate effect size is \n") print(conf.intervald) ### Plotting CI on d using plotrix library(plotrix) plotCI(x = 0.80, y = 0, ui = 1.319, li = 0.275, err = "x", ylim = c(-5,5), xlab = "Effect Size", main = "95% CI on Effect Size d", ylab = "",slty= "dashed", xlim = c(0,1.5)) legend(0.2, 4, legend = "d = 0.80", bty = "n") legend(0.2, 3, legend = "Upper Limit = 1.319\nLower Limit = 0.275", bty = "n") ##################################################################### # Zhong, Bohns, Gino Good lamps lightmean <- 7.78; dimmean <- 11.47 lightsd <- 3.09; dimsd <- 4.32 n <- 42 # (lightsd = desired sd but sd(light = obtained sd)) light <- rnorm(n, lightmean, lightsd) light <- light*lightsd/sd(light) light <- light - (lightmean - mean(light)) light <- light-(mean(light-(lightmean))) mean(light); sd(light) dim <- rnorm(n, dimmean, dimsd) dim <- dim*dimsd/sd(dim) dim <- dim - (dimmean - mean(dim)) dim <- dim-(mean(dim-(dimmean))) mean(dim); sd(dim) t.test(light, dim) t.test(light, dim, var.equal = TRUE) Welch Two Sample t-test data: light and dim t = -4.5024, df = 74.25, p-value = 2.447e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -5.322918 -2.057082 sample estimates: mean of x mean of y 7.78 11.47 ############
![]()
That is enough for now. The file Chapter6.R contains a few additional things that I have not discussed here.
![]()
dch: