More Examples Using R

Introduction

The following discussion assumes that you have downloaded and installed the editor RStudio. If you have not, you can still use what follows by using the "new script" command from the R screen. RStudio just makes things a great deal easier, and is what I used to create the images below.

Now that you have downloaded and installed R and RStudio, and looked at some simple examples anda page on how to read or enter data, start up RStudio. If you are using RStudio, R opens automatically. Your screen should look something like the following. I suggest that you drag the borders to make them a bit narrower. It just saves aggravation.

We will start with something simple, and it won't be "Hello World." In Chapter Two of my book, Statistical Methods for Psychology, 9th ed., I refer to a study by Langlois and Roggman on attractiveness ratings assigned to photographs. The data for 20 participants follow.

   1.20, 1.82, 1.93, 2.04, 2.30, 2.33, 2.34, 2.47, 2.51, 2.55,
   2.64, 2.76, 2.77, 2.90, 2.91, 3.20, 3.22, 3.39, 3.59, 4.02)

We need to read these data into R so that we can work with them. There are only 20 pieces of data, so we can enter them directly rather than creating and reading a data file. On the RStudio screen on the upper left enter

 data4 <- c(1.20, 1.82, 1.93, 2.04, 2.30, 2.33, 2.34, 2.47, 2.51, 
 2.55, 2.64, 2.76, 2.77, 2.90, 2.91, 3.20, 3.22, 3.39, 3.59, 4.02)
   

You don't have to lay the data out as neatly as I have. Type until you come near the end of a line, insert a comma as needed, hit return (enter), and keep typing. Don't forget the closing parenthesis followed by a carriage return. The "c" command means "concatenate", so the variable data4 will be a set of 20 numbers. If you see a plus sign on the left margin, that is just R's way of indicating a continuation line.

If you entered the data in RStudio, put your cursor on the first line and click on the "run" menu button, or "Command Enter" (once per line). The results will appear in the lower window.

That probably doesn't leave you all excited about your programming skills, so let's go a step further. If you type

    xbar = mean(x)
    print(xbar)

you will see

    > xbar <- mean(x)
    > print(xbar)
    [1] 8.571429

Notice that R prints out your commands (preceded by the ">" prompt) as well as the result. We can alter this design later.

What has happened here is that long ago someone wrote a function to calculate the mean of whatever it was fed. Not surprisingly, they named the function "mean()." So R grabs x, trots off to that function, and comes back with a variable named "xbar". (We could have named it diddly-doop if we wanted to.) The print command then tells R to print out xbar.

But wait! Earlier I just typed "x" and R printed out x. But here I typed "print(xbar)" What is the difference? Well, here there is no difference. When you are working on the command line, or when you are working in R and submitting your code as you go along, you don't need the print command. BUT, suppose that we write this stuff in RStudio, save it to a file named FirstRprog.R. Then we use the drop-down menu in R to select File/source R code and then go to the file that we created. R will run that program, but perhaps all that you will see is "[1] 8.571429" You won't see the code and you won't see x. You will only see what the print command told it to print. (I said "perhaps" because this depends on the editor and on how it is set up.) When we are writing code we often just name a variable and R prints it out. But if we think that we are going to save the code and run it as a program, then we should wrap "print( )" around what we want printed out. By the way, "source" in the above is meant as a verb. When someone on a help page says "source your file," they mean that you should submit the file. (Unix types often use weird grammar. Yea, I know that I often pick on the Unix guys--but they often sneer at the rest of us.)

Let's back up a bit--Entering data

There are several different ways of entering data, but we are only going to touch on two of them. One you have just seen, which is to use the "x <- c(4,7,8,9)" command. You can do that for all of your variables if you want to, but that becomes a nuisance. The other way is to take any old text editor (Notepad will even do) and create a file with the data in different columns. You can put a tab or a couple of spaces between columns, but try to make them look neat. I strongly suggest that the first row of data be the variable names. For example, your file might look like


which contains the data for Table 7.7 in the text. Once you have entered your data and saved it, you can enter the command


data1 <- read.table(file.choose(), header = TRUE) 

The file.choose() command will cause it to open up a dialog box so that you can hunt around for the file you want. When you find it, just click on it and it will open and be known within R as data1. The header = TRUE command tells R that the first line contains variable names. Now rather than print out the whole file to see what we have, we can print
head(data1)
and get
data file #1

Let me digress a second to point out that if you are in RStudio you can go to "Tools/Global Options/General" and set the default working directory. (See what I say about this below.) Then when you name a file, or need to "choose" a file, RStudio begins by looking in that directory. That can save you a lot of hunting.

Now you have your data read in, but perhaps not quite in the way you expect. One of your variables is named Score, but if you ask R to type out its values you will get

  
  > Score
   Error: object 'Score' not found

The problem (if there is one) is that data1 is what is called a data frame. A data frame is basically a file with a bunch of columns, and Score is part of that file. You could use that awful attach() command, which would make those variables available--and all set to cause trouble-- but I strongly recommend against it. For more about "attach()," click on attaching.html. You are better off using "data1$Score," if you had named the data frame "data1," and everything will be fine. So add the name of the data frame to the variable name by typing


   > data1$Score
    [1]  4  9 12  8  9 13 12 13 13  7  6  7  8  7  2  6  9  7 10  
	   5  0 10  8
   > 

then Score will be a legitimate variable by itself and you can now print it out as I did here. This is true whenever you have a data frame. You need to either use the "data1$subject" convention or attach the data frame. If you haven't read my rant on attach(), go to my discussion at "attach()".

Most of the time that is not a problem because you are just masking the one variable with an exact copy. But the message will often lead you to think that you have made an error.

Other methods of data entry

There are many other ways to enter data. One is by way of an Excel spreadsheet. Another is by way of the edit command. Try the following commands, one at a time, on the command line to see what happens.

   
   edit(data1)
   newfile <- edit(data.frame())
   write.table(data1, file = "Newfile.dat", row.names = FALSE)

The last command will create a file named Newfile.dat, but be sure to give it a more complete address or you won't know where it will end up. BUT having to type in the full search path is a pain. Let me repeat what I said earlier. Assume that you have created a directory (folder) called "Learning-R." I don't care where you create it, but probably in your documents folder. Now go to the Rstudio console (where you have been seeing these results). In the menu click on "Session" and select "set working directory." and click on Choose Directory. Navigate to the Learning-R folder and click OK. Now that is your default directory, and if you just have 'file = "Newfile.dat"', it will be sought in that directory. Much easier! That is also the first folder that will open if you use "file.choose()."

Other Simple Commands

We are not limited to just printing out means. There are lots of other descriptive statistics that have their own functions. Some of these are shown below along with the results that R prints out.

   > mean(data4)
   [1] 2.6445

   > length(data4)  # reports the number of observations in data3.
   [1] 20

   > var(data4)
   [1] 0.4292892

   > sd(data4)
   [1] 0.6552017

   > hist(data4)
   
   


I cheated here a little bit. The graphic may not come out on your R console. It may come out in its own window. (I don't know why this sometimes happens--it may be related to the operating system -- I know that there was a problem with version 98 on a Mac.) You may have to hunt around on your screen. Or go to the R icon on the bottom of your screen, click on it and select the image of the graphic.

You Actually Know More than You Think

Those few commands that you just saw will take you a long way. You could, for example, do many of the exercises in Chapter Two without learning any more. You could probably guess at a few other functions such as median(data4). Try typing sqrt(data4). I bet that isn't quite what you thought that you would get.

Now let's look at more graphics.


dch

Free JavaScripts provided
by The JavaScript Source