More Examples of Calculations in R

David C. Howell

GreenBlueBar.gif GreenBlueBar.gif


The purpose of this section is just to show you more about how R operates. At one level R is just a great electronic calculator. At another level it creates really nice graphics, and at a third level it has thousands of built in function that you can use to do all sorts of things.

Just a calculator

The first thing that we will do is to use R as a dumb calculator. If you enter an equation (e.g. b <- 5/3) the calculator will solve the equation and store the result. If you leave off the left side of the equation, it will evaluate the right side and print it out automatically. For example

> 5/3
[1] 1.666667
and
> b <- 5/3
> b
[1] 1.666667
> 

The "greater than" sign in the left margin is simply the prompt showing that this is a command line. The "[1]" tells you that 1.6667 is the first of the outputs. In this case there is only one output.

Now that we have b, we can use it in a calculation. For example, we could create a variable called x and multiply b and x together

> x <- 7
> product <- b*x
> product
[1] 11.66667
> 

Those results are not particularly interesting, so let's move on to something better. R works with vectors. That means that b can be a single number or it can be a string of numbers. Suppose that we want to find the mean of a set of numbers. We can create the set using "c" as the concatenation operator--that just means that it strings things together. For example

> x <- c(12, 14, 16, 14, 19, 20, 23)
> x    #It is probably smarter to type "print(x) because it works under more conditions.
[1] 12 14 16 14 19 20 23
> sum(x)
[1] 118
> length(x)
[1] 7
> mean(x)
[1] 16.85714
> 

The first line creates the set of x values, called a "vector." The second prints them out. (The [1] on the left says that 12 is the first value (of 7).) The next line says that we want to calculate the sum of the values of x. In the old days you would have to say "start with total = 0, add x(1) to it, add x(2) to that, add x(3) to that, and so on." Here someone has written that kind of code for you and named it the "sum" function. (Actually, the underlying code for sum() is optimized and very much faster than the step-by-step loop I described.) When you type sum(...), it goes and finds that function and does the work. You don't have to think about it. The next two commands compute the length (the number of observations in x) and the mean. We could also get the mean by typing

> xbar <- sum(x)/length(x)
print(xbar)
[1]  16.85713
> 

You probably aren't quite satisfied by my explanation of the [1] in the margin. Well let's create a large vector of random numbers (perhaps 100 of them) and print them out. We have a random number generater that someone wrote, made into a function, and stuck into R, so creating those numbers is a breeze. I want 100 random numbers drawn from a normal distribution with mean = 35 and standard deviation = 7.

y <- rnorm(100, 35, 7)
> y
  [1] 47.84491 34.26865 39.36875 25.22486 40.84229 37.31904 38.91220 36.36298 39.66853 27.86605 22.59084
 [12] 37.93739 29.37225 29.49127 32.81972 31.38824 26.29619 29.90322 37.35145 34.41677 22.94971 30.35482
 [23] 27.04887 37.99185 32.04899 38.93400 34.26171 29.55976 33.97462 28.77199 30.68097 43.87556 36.25555
 [34] 38.21872 36.91282 28.74664 36.05580 32.58514 19.76187 27.23863 40.32444 28.34713 33.25422 37.10347
 [45] 32.88337 29.75646 39.65179 30.31897 35.72330 29.27479 30.00774 32.19672 33.66307 40.42661 38.28807
 [56] 34.21049 30.60006 42.31394 40.19359 46.54127 32.24113 34.06356 21.60802 36.04582 35.69558 47.75888
 [67] 40.22709 34.05382 36.03204 40.29976 54.91402 43.98330 44.55602 41.72782 42.20727 38.08900 34.85992
 [78] 36.74180 42.81791 30.68213 26.74414 22.53559 37.61425 35.48769 32.73329 28.38565 20.69425 35.00795
 [89] 40.42212 31.61166 46.42465 32.86052 46.58976 34.12389 40.37112 32.76762 38.30332 33.29761 43.97388
[100] 40.05273

Here 47.84491 is the first number, the first entry in the second line (37.93739) is the 12th, the first entry in the third line (27.04887) is the 23rd, and so on. The trick is that the notation x[1] is a subscript to be read as "x sub 1." x[12] is "x sub 12" and so on.

Vectors are Powerful

Above I said that R operates with vectors that can be of any length. "27" is a vector of length 1, "34, 56, 67, 46" is a vector of length 4, etc. A very powerful thing about R is that it doesnt' care how long a vector is. It treats a multi-element vector the same way it treats a simple number. For example we all know that 5/3 is 1.6667. But what if I have a long vector?

> x <- c(3,6,8,12,15)
> x/3
[1] 1.000000 2.000000 2.666667 4.000000 5.000000
> 

Notice that when x is a vector with 5 elements, x/3 is also a vector with 5 elements, each of which is the corresponding value of x divided by 3. Now that may not sound like such a great thing, but it can save an enormous amount of work. Suppose we go back to my 100 random numbers. I want to get the mean of y, which is simple, and I want to get the sum of squares-i.e. the sum of each squared element of y. Below I show two ways of getting the latter.

> ybar <- mean(y)
> ybar
[1] 34.9916
> y2 <- y*y
> y2
  [1] 2289.1351 1174.3405 1549.8981  636.2936 1668.0928 1392.7109 1514.1595 1322.2663 1573.5924  776.5168
 [11]  etc
> sum(y2)
[1] 126599.4
># OR, to do all of that in one step,
> sumy2 <- sum(y*y)
> sumy2
[1] 126599.4
> # or, to use the "^2" operator, which raises things to the second power,
>sumy2 <- sum(y^2)
> sumy2
[1] 126599.4

The last thing in that example is worth a comment. Notice that I don't have to get a vector of y^2 and then sum it. I can do it all in one statement by lumping things together. As you will see in the text, the standard deviation is the square root of 1) the sum of the squared values of y minus 2) the sum of y squared divided by N, and 3) all that divided by N-1. Here N stands for the number of elements of y. When I compute a variable, it is always around for me to use it again, unless I write over it. So, if I wanted to, I could write

N <- length(y)
sumy <- sum(y)
sumy2 <- sum(y^2)
numerator <- sumy2 - (sumy)^2/N
denominator <- N-1
stdeviation <- numerator/denominator
print(stdeviation)
# But I can also cram all of this onto one line
> stdev <- sqrt((sum(y*y)-(sum(y)^2)/length(y))/(length(y)-1))
> stdev
[1] 6.480909 
>sd(y)
[1] 6.480909

Some people like to write code like the long version. I think that it is too hard to read, so I would prefer to do it in individual steps just so I'm sure what I am really doing. Otherwise it is easy to get parentheses in the wrong place, and once they are wrong, they are very hard to find. Either way works.

In the next section we will look more closely at how to enter data into R. That can save a huge amount of time. You have seen one way [x <- c(3,5,6,9)], but that is only useful if you have a small number of observations.


dch: