+1 (208) 254-6996 [email protected]
Select Page
1. Of the 5 glm() models you ran, which appears to be the best model? Support your answer by providing the relevant information from your R output. You should also provide ΔAIC values for all 5 models in your answer.
2. Of your 5 models, what was the model weight (Akaike weight) of the second best model? Show your calculation of this value.
3. Provide the image of the graph showing the confidence and prediction limits for the role of latitude in explaining elevation of ponderosa pine.
4. Provide the 95% confidence and prediction limits, as well as the predicted point value of elevation for ponderosa pine at 39.006 degrees N latitude?

## Assignment #2: Regression Models

For this assignment you will need to download the trees.csv file. This file contains data on the elevational distribution of five ‘tree’ species. For your analysis you will be looking at variables that may influence the elevation at which the species is found. During this process, you will hopefully learn a bit about linear regression models and aspects of evaluating models.

The data for this analysis was created by randomly selecting points (n=100 for most species) using GIS maps of the distribution of the species. These were then intersected with digital elevation models and the mean annual temperature (Bio_1) layer from WorldClim ( http://www.worldclim.org/ ). We are looking at five species: PIPO (Ponderosa pine, Pinus ponderosa), ABLA (Subalpine fir, Abies lasiocarpa), JUSC (Rocky Mountain juniperJuniperus scopulorum), PIED (Piñon pine, Pinus edulis), and CAGI (Saguaro, Carnegiea gigantea).

Don't use plagiarized sources. Get Your Custom Essay on
Regression Models Using R Or R-Studio
Just from \$13/Page

Types of Variables and Other Terminology

In regression, we consider two broad types of information that are used in the modeling. We are interested in what features influence something about another feature. The feature being influenced is a variable that we call the ‘response variable’. Other common terms for this are the ‘dependent variable’ as its value depends on the value of other variables. These other variables that we believe influence the response variable are called ‘explanatory variables’ because they explain the response variable. Other common terms for these variables are ‘independent variable’ or ‘covariate’.

In a simple example of a regression model we have the equation of a line: Y = a + bX (which you may have seen before as mX+b). We will make a simple change of notation to make this into standard regression ‘language’. The a will be called β0 and the b becomes β1 so our equation is now Y=β0+β1X. In this case, X represents a single explanatory variable, and Y represents the response variable. What do β0 and β1 represent? These are called ‘coefficients’. They are parameters we estimate during the regression modeling process. In this case, you probably already know that m is the slope and b is the y-intercept. In the following you will see references to the ‘intercept’ and the ‘slope coefficients’ which indicate we are dealing with linear models. In many linear models the equation is more complex and may not produce a true line, but these are really expansions of this basic equation. You may see the coefficients referred to as ‘betas’ at times as well, referring to the notation using the Greek letter β.

For this assignment, we are interested in what variables may explain the elevational distribution of these species, so elevation (in m) is our response variable. Our possible explanatory variables are latitude and longitude (in decimal degrees) and mean annual temperature (in 10ths of deg C). Because we may be asking about differences among the species, the species can also be considered an explanatory variable.

Now let’s open R and get to work. Remember to set your working directory following the procedure used in the Week 1 Assignment. Once you have done this, you can import the data in trees.csv into an R object called ‘trees’ and then we will perform some basic manipulations that will be necessary for our analysis. Remember that all items below that are in bold, italic are R commands.

Let’s import these data using the read.csv function, and then let R know that the Type column should be treated as a factor – or nominal (name) variable.

trees\$Type<-factor(trees\$Type, labels=c(‘PIPO’,’ABLA’,’JUSC’,’PIED’,’CEGI’))

Because the temperatures in these data are in tenths of a degree Celsius, we want to convert them to degrees Celsius by dividing them by 10 and placing that result back in the original column.

trees\$MeanTemp<-trees\$MeanTemp/10

Let’s summarize and visualize these data so you get a feel for it using boxplots to see the elevational distribution by species for these data. We use the summary() and boxplot() functions you learned last week to obtain summary values for each variable and produce potentially useful boxplots. The par() function is changing the graphics window so that all four boxplots are produced in the same window.

summary(trees)

par(mfrow=c(2,2),mar=c(5,4,4,4))

boxplot(trees\$Elev~trees\$Type, col=c(2,3,5,6,7),main=’Elevational Distribution by Species’, ylab=’Elevation (in m)’, xlab=’Species’)

boxplot(trees\$Long~trees\$Type, col=c(2,3,5,6,7),main=’Longitudinal Distribution by Species’, ylab=’Longitude’, xlab=’Species’)

boxplot(trees\$lat~trees\$Type, col=c(2,3,5,6,7),main=’Latitudinal Distribution by Species’, ylab=’Latitude’, xlab=’Species’)

boxplot(trees\$MeanTemp~trees\$Type, col=c(2,3,5,6,7),main=’Mean Temperature by Species’, ylab=’Temperature (deg C)’, xlab=’Species’)

We may also want to look at a scatterplot of these data to see how latitude, longitude, and species, our explanatory variables, are possibly related. We will essentially be creating a map of the distribution of these species. Keep in mind it will be somewhat distorted because degrees latitude and longitude will be treated as the same size when they really are not the same and the ‘length’ of a degree longitude shortens as you move away from the equator. We need to reset our graphic parameter to a single box, and then use the plot function to create an empty plot (type=’n’argument) specifying that longitude is on the x-axis and latitude on the y-axis. We then use the points() function to add in the locations of each species sample with a different color and marker, and add a legend with the legend() function.

par(mfrow=c(1,1),mar=c(0,0,0,0))

plot(trees\$Long, trees\$lat, type=’n’, bty=’n’, main=’Geographic Distribution by Species’,ylab=’Latitude’, xlab=’Longitude’)

points(trees\$Long[trees\$Type==’PIPO’], trees\$lat[trees\$Type==’PIPO’],col=2,pch=2)

points(trees\$Long[trees\$Type==’ABLA’], trees\$lat[trees\$Type==’ABLA’],col=3,pch=3)

points(trees\$Long[trees\$Type==’JUSC’], trees\$lat[trees\$Type==’JUSC’],col=4,pch=4)

points(trees\$Long[trees\$Type==’PIED’], trees\$lat[trees\$Type==’PIED’],col=5,pch=5)

points(trees\$Long[trees\$Type==’CEGI’], trees\$lat[trees\$Type==’CEGI’],col=6,pch=6)

legend(‘bottomleft’,title=’Species’,c(‘Ponderosa’,’Subalpine Fir’,’Rocky Mtn Juniper’, ‘Pinon’, ‘Saguaro’),col=c(2,3,4,5,6),pch=c(2,3,4,5,6))

Okay, now let’s build some regression models. In our first case we will be using simple linear regression so we are basically applying the Y=β0+β1X formula. You may have noticed in one of your boxplots that there is a lot of overlap among most of these species in their elevation. However, remember that boxplots are showing the variation in the sample, maybe we want to see if the mean elevations are different, which is often the question of interest. We can do that by finding the confidence intervals for each species, and we’ll use a linear model to find those intervals. In R there are multiple ways to fit regression models, the most basic uses the lm() function (which stands for ‘linear model’) although the glm() function (for ‘generalized linear model’) can fit a wider type of models. We will use the lm() function here. Note that in the parentheses there is a formula similar to that used in the boxplot() function – these is specifying our regression model to fit. The response variable goes on the left and the explanatory variable(s) goes on the right with a ~ representing the equal sign. For Y=β0+β1X we would write Y~X. Note that we do not specify the intercept term because it is fitted by default. The data argument just tells R that the data is in object ‘trees’ so we don’t have to be writing trees\$Elev~trees\$Type. The results of the model fit are being placed in the object mod.type.

mod.type<-lm(Elev~Type, data=trees)

We can know look at the model that was fitted by using the summary() function.

summary(mod.type)

Be sure to look at this output carefully and detect what features you recognize from previous statistics classes. Why do we have 5 coefficients (an intercept and one for each of 4 of the 5 tree species) reported? Shouldn’t there be just two? Remember that the explanatory variable is a nominal variable (a factor). It is not the typical X variable of the equation of a line. In reality, we just fitted the following model:

The ,,,, and are the coefficients reported in the summary of this model for the ABLA, JUSC, PIED, and CEGI types respectively. Remember that we were wanting the mean elevation for each of the tree species and we wanted to know if they were different from each other. How do we get that? In the equation above the variables for the tree types (ABLA, JUSC, etc) are really just 0’s or 1’s. For example, if we want the mean elevation for ABLA (subalpine fir) in our dataset, the equation becomes:

Because we’re multiplying 3 of those coefficients by zero they drop out so the equation really is:

The mean elevation of subalpine fir in our sample is 1368 meters. We can verify this using the mean() function

mean(trees\$Elev[trees\$Type==’ABLA’]

So why did we do all this if we could have just used the mean()? Hopefully this has helped you understand what regression models are doing – in this case finding the coefficients that best fit the data to produce an estimate of a response variable based on combinations of explanatory variables. Take some time to make sure you understand this procedure as this is the basis of all regression models, including the more complex forms of generalized linear regression such as logistic regression, Poisson regression, etc.

Before we move on, you may notice that the tree type ‘PIPO’ is not present in the output. Why? Ponderosa pine are the intercept value in this case, so the mean elevation of ponderosa pine is given by the intercept = 1429.18 m. The elevation of all other species is in relation to this value, so we know that the subalpine fir has a mean elevation 61.26 m lower than ponderosa pine – the slope coefficient gives the value higher or lower than the reference type of the intercept type. This allows us to get information we didn’t get using just the mean. If a slope coefficient isn’t different than zero, then that variable doesn’t add any explanatory value to the intercept. The coefficients reported above are the point estimates for the coefficient value, we really want to see what the 95% confidence interval estimates of these coefficients is. We do this using the confint() function.

confint(mod.type)

Note that the coefficient interval estimate for subalpine fir goes from -211 to 89 meters – it includes the value zero. Remember these are slope coefficients. What type of line has a slope of 0? A horizontal line. This means that we really cannot conclude that the mean elevation of subalpine fir is different than the mean elevation of ponderosa pine.

Question note: Make sure you record the values for the point and interval estimates of all coefficients from this model for use in answering the questions.

From your previous statistics classes you may have recognized that the above question could be asked using Analysis of Variance (ANOVA) because we have a continuous response variable and a categorical (factor) explanatory variable. We can show the ANOVA table using the anova() function specifying our model object in the parentheses.

anova(mod.type)

Compare the results of this table with that provided in the summary(mod.type) output. Note that you see a F value (93.066) and a p-value (2.2×10-16) for the test of the null hypothesis “Tree species do not differ in elevation” in both the ANOVA table and the summary output. This is a key point to recognize – ANOVA is based on linear regression, it is not a separate type of analysis. The difference is the information you can utilize from the analyses – note the different outputs – which ones produce more information?

The t-test you probably remember is just a simplified version as well, in the case where there are only two categories involved. If these data were in the right format (they weren’t collected to meet the needed assumptions) we would find that the t.test() function would provide an estimate of the difference of mean elevation of ponderosa and subalpine fir exactly the same as the coefficient for subalpine fir, and it would report the same t-statistic and p-value reported in the summary of the linear model for the subalpine fir coefficient (i.e, .t=-0.799, p=0.42455). We see that the t-test and ANOVA are just special cases of linear regression that emphasize significance testing whereas the regression approach allows us to work in both the significance testing and information-theoretic (discussed shortly) frameworks, and provides substantially more information about the relationship than the other two approaches.

What if we wanted confidence intervals on the mean elevation for each tree species, rather than on the difference in elevation of the four tree species from ponderosa pine? To obtain that we specify what tree type we want to be the intercept in our regression model using the relevel() function. Because ponderosa pine is the intercept now, we can get it’s 95% confidence interval directly using

(pipo.cl<-confint(mod.type)[1,])

The [1,] is telling R to extract the first listed confidence interval (row 1) from the output. We now use relevel() to set subalpine fir as the intercept, and then rerun the regression and extract the confidence interval on the intercept.

trees\$Type<-relevel(trees\$Type,’ABLA’)

mod.type<-lm(Elev~Type, data=trees)

(abla.cl<-confint(mod.type)[1,])

Question note: You should now be able to find the 95% confidence intervals on mean elevation for the remaining 3 tree species (JUSC, PIED, CEGI). Keep these confidence intervals for all 5 species for use in answers.

### Maximum Likelihood Estimation (MLE) and the Information-Theoretic Approach with AIC

Maximum likelihood estimation underlies much of modern statistics. The concept was first described by Sir R.A. Fisher in 1912 when he was an undergraduate at Cambridge. It works on a really basic premise – the best estimate for a given parameter (e.g., the mean of the population being sampled) is the one that makes the presumed distribution fit the data better than other such estimates. For example, the normal distribution has a probability density function (pdf) given by which produces the readily recognizable bell-shaped curve. For a mean of 50 and standard deviation of 15, it should be intuitive that the highest density value, 0.0266 in the figure below, should be associated with the true mean of the distribution, i.e., the normal pdf is maximized at the value of the mean.

Figure 1. PDF of Normal Distribution w/mean=50 & sd=15

Similarly, if we have some data and a hypothesized model to fit them, we should pick the parameters of that model (mean and standard deviation in the above case) that make the model’s density function fit the observed data the ‘best’. Note that this is basically a probability statement, but with a twist. If we go with the standard notation of probability, we can make a statement like ‘the probability of x given the model parameters equals’ as Pr{x | parameters}=. For the figure above, this would be Pr{50 |50,15}=0.0266, which happens to be the maximum that can be obtained. Likelihoods just turn the statement around to reflect what researchers actually do – we have data and we want to know the underlying parameter values – which means we refer to the ‘likelihood of model parameters given the data’ or L(parameters|data) – we want the value of the parameters that maximize the likelihood function built with the observed data, and we use those as our best estimates of the underlying population values we are trying to estimate.

Let’s illustrate this with a simple probability example, which is directly comparable to flipping a coin. Let’s assume we have 100 animals radiocollared and we follow them for some set period, say a year. During this period, 18 of the animals die. We want to know the best estimate of the yearly survival rate. Intuitively, we had 82 animals out of 100 survive so we’d think 82/100=0.82 is our best guess. Now let’s look at it from a MLE standpoint, what is the likelihood of 0.82 (=p) given N=100 trials and y=82 successes? This follows a classic binomial (or Bernoulli) distribution which has a likelihood function of: . We see in the figure below that, in fact, this function is maximized when p is set to 0.82, resulting in a likelihood value of 0.103.

Figure 2. Binomial likelihood function w/N=100, y=82.

So why don’t we just use the estimator p=y/N? Well, we do. This is the ‘closed’ form solution to finding the maximum of the above likelihood. For those of you with calculus backgrounds, you will remember that finding maxima (or minima) of functions is done by solving the first derivative (partial with respect to the parameter of interest) of the function when it is set equal to 0. The following figure shows the value of the first derivative of this function as p varies, and we see that it is 0 when y/N=0.82.

Figure 3. First derivative of binomial likelihood with N=100, y=82.

y/N is the solution when the first derivative is set equal to 0 and is known as the maximum likelihood estimate for a Bernoulli trial situation. Many of our common estimators are MLE’s for the same reason – they are the closed form solutions to the first derivative of the likelihood when it is set to zero. However, for many of our applications, there is no closed form estimator available, so we must use numerical optimization routines to develop our MLE of the parameter of interest.

For several reasons we often use log-likelihoods instead of likelihoods. If you notice in Fig 2 above, the value of the likelihood is extremely small (essentially 0) for much of the range of p, which can create issues due to rounding to 0, especially in a numerical optimization. In addition, logarithms often simplify calculation by making multiplicative functions additive which can also help in numerical optimization, e.g., becomes . This results in the following likelihood shape (compare to Fig 2).

Figure 4. Log-likelihood function of binomial w/N=100, y=82.

#### Why MLE’s?

Okay, so why do we care about using Maximum Likelihood Estimators (as opposed to other forms of estimators that aren’t MLE’s)? Well, for several reasons, all of which imply that there are no better estimators available given a large sample size. The likelihood principle states that the likelihood function contains all relevant information from a sample. Some reasons for using MLE’s are:

· They are asymptotically (i.e., at large sample sizes) normally distributed

· They have minimum variance (therefore provide more precise estimates)

· They are asymptotically unbiased.

· They are efficient (i.e., extract information most effectively, hence the minimum variance)

· They relate to Fisher information, which enables use in model selection, among other things.

#### Comparison of likelihoods and least squares

Most people are introduced to regression (and ANOVA) through the method of least squares, you have probably heard of ‘Sum of Squares’ and seen it calculated as where x is an individual data point. This is the sum of the squared residual values. For data with normally-distributed errors, it happens that the MLE of the regression is the value that minimizes the sum of squares, which is why it is called ‘least squares’. In simple linear regression, these can be solved with closed-form equations, which are the series of least squares equations you may have had to calculate at some point in previous statistics classes (SST, SSE, MSE). In log-likelihoods involving normally-distributed error, the deviance is equivalent to sum of squares term in least squares.

#### Profile likelihood confidence intervals

Remembering that deviance is a general form of sum of squares – it gets at the amount of residual variation that is unexplained by the model. It also is approximately chi-square distributed with 1 degree of freedom. This leads to the ability to generate confidence intervals based on the likelihood function. The χ2 value with 1 df for 95% confidence= 1.92. If we subtract 1.92 from the maximum value of the log-likelihood, and find the parameter values that produce these values in the likelihood we get the 95% confidence limits for our parameter estimates. In the example we’ve been using, we have a point estimate of 0.82 which was produced when the log-likelihood was maximized at -2.269711. Subtracting 1.92 from this gives us -4.189711. We then find that the log-likelihood equals -4.189711 where p=0.737 and 0.887. Therefore our 95% profile likelihood confidence interval is [0.737, 0.887]. We can see this in a zoomed in version of figure 3:

Figure 5. Profile likelihood confidence intervals.

You may notice that the confidence interval is not symmetrical about the point estimate (0.82), extending further on the lower end than on the upper end. If you look at figure 3, you will notice that the log-likelihood function is not symmetrical, it is steeper on the right side than on the left side. It should not surprise us that we then that the confidence interval is asymmetric, and in fact we should not expect them to be so unless we assume error is normally distributed. You should note that when you receive confidence interval estimates on coefficients from your linear models based on lm() and glm() you are getting profile likelihood intervals.

#### Information Theory and Akaike’s Information Criterion (AIC)

AIC stands for ‘An information criterion’ but is more commonly known as “Akaike’s Information Criterion” because it was developed by Hirotugu Akaike. AIC is based on the Fisher Information from the log-likelihood (the negative second derivative). Recall that this is related to the deviance. Specifically, AIC=-2logL+ 2K where K is the number of parameters in the model and logis the value of the log-likelihood function at its maximum. In our example above, if you remove the constant involving the factorials, the value of the maximum at its maximum is -47.13935. There is only one parameter, p, to be estimated in the model, therefore the AIC=-2(-47.13935)+2(1)= 96.2787. This is the same AIC that R would report if you ran a GLM on a factored vector, success, consisting of 82 1’s and 18 0’s. Run the following code:

success<-factor(c(rep(1,82),rep(0,18)))

We have used the glm() function to build our model, which presents a somewhat different output from the lm() function. Note the AIC is reported at the end, and that this matches the value shown above. Also note that our formula was success~1. We have not specified any explanatory variables. The 1 indicates we want to fit an intercept-only model to these data.

If models are based on the same underlying data, then they can be compared using AIC. Models with lower AIC values fit the data better than those with higher AIC values. Note that there are two reasons for this:

1. they have higher maximum values for their log-likelihoods (as log-likelihoods are generally less than 0 their negatives will be lower), and

2. more complex models will have a higher ‘parameter penalty’ (the 2K part). AIC is trying to combine fit with model simplicity, with the goal of trying to find the simplest model that fits the data well – this follows the ‘Principle of Parsimony’ or ‘Occam’s Razor’.

We usually compare models using deltaAIC (ΔAIC) which is the difference =AICbestmodel-AICcurrentmodel. Note that there isn’t a clearcut decision as when a model is not competitive, however, a common recommendation (Burnham and Anderson 2002) is that if ΔAIC<2 than the model is competitive, and support for the model drops off quickly as ΔAIC gets larger, with ΔAIC>7 indicating highly improbable models. Another common way of looking at models is based on Akaike weights, which can be used to look at the weight of evidence in support of the given model. Akaike weights are given by where the denominator is summed over all models in your model set. We can then compare two models, 1 and 2, with weights w1 and w2, using the evidence ratio=w1/w2. For example, let’s look at the following 5 models with ΔAIC’s of 0, 1.2, 2, 3.2, and 7. Based on these values we get the following where the evidence ratio is best model compared with the model listed (e.g., model 1 is w1):

We see that model 1 has 46.5% of the weight of evidence, but model 2 still has 25.5% of the weight. The odds of model 1 being the best model over model 4 are nearly 5 to 1 (explicitly, it is 4.95 times more likely than model 4). The nice thing about these approaches is it allows us to quantify the uncertainty involved in our model selection process, which enables us to incorporate it – which is important in that it allows for model averaging in which we don’t have to select a single ‘best model’.

One last thing concerning AIC. It is recommended that we use the small sample size correction form of AIC, AICc, which slightly increases the weight of the ‘parameter penalty’. Specifically this is given by: where n is the sample size.

### Model Selection

Now let’s return to our tree dataset to compare several models to see which explanatory variables produce the model that best fits these data. We are going to use a generalized linear model with a normally-distributed error and the ‘identity’ link function. We will have three different models, we have already run the model above (mod.type) that has tree species as the only explanatory model. Now we’ll use both tree species and latitude as explanatory variables in an additive model. First let’s rerun our first model using glm() and place the results in object ‘gmod.type’ and then see what the results look like compared to the previous output.

gmod.type<-glm(Elev~Type, data=trees)

mod.type

gmod.type

Now let’s run a multiple linear regression with tree type and latitude as the explanatory variables.

gmod.typelat<-glm(Elev~Type+lat, data=trees)

summary(gmod.typelat)

We’ll also show the 95% confidence intervals on the coefficients. Note that there is a line at the top printed while the function was running that states “Waiting for profiling to be done …”. R is calculating the intervals using the profile likelihood approach as discussed above.

confint(gmod.typelat)

Then we’ll use only latitude as an explanatory variable. We’ll log transform elevation in this case because the latitude variable is somewhat skewed. By taking the natural logarithm of the latitude these data become more normally distributed. That does mean we have to remember the values reported in the output are the logarithms of the latitude, so if we want to report the real latitude values we need to back transform the values with evalue.

gmod3.lat<-glm(Elev~lat, data=trees)

summary(gmod3.lat)

confint(gmod3.lat)

So now we have three linear models, two with a single explanatory variable each, and one with these two variables both included in an additive model. Which one of these models best describes (or ‘fits’) these data? This is where AIC is useful. Remember that as AIC gets smaller the model appears to fit the data better. In addition, adhering to the ‘principle of parsimony’, large complex models with many terms (so they have many parameters) are penalized over simpler models. AIC then provides a means of ranking models with a preference toward finding the most simple model that fits the data well. This is the basis of the information-theoretic approach to statistical inference that predominates ecological statistics today (see Burnham and Anderson 2002). Anderson et al (2000) provides an additional short accessible overview of the information-theoretic approach.

So let’s look at the AIC’s for our models, which are extracted using the AIC() function:

AIC(mod.type)

AIC(gmod2.typelat)

AIC(gmod3.lat)

We can clearly see that the addition of latitude as an explanatory variable made this model fit the data much better. In practice, we use delta AIC (ΔAIC) to compare all models with the model with the model that has the lowest AIC. There is no hard and fast ‘rule’ but models with ΔAIC’s > 2 are generally considered not particularly competitive. However, there is some inferential uncertainty here and we can capture that with ‘model averaging’, but that is beyond our scope today.

Question note: You should now be able to run additional models. Run at least two additional models, one with a single explanatory variable, and one with two explanatory variables. Record the output for all 5 GLM models, including confidence intervals on the coefficients. In addition, note the AIC for all 5 models you have run. You will use this information to answer questions.

Let’s conclude the assignment by looking at the confidence intervals and prediction intervals of our current best model (gmod.typelat). Remember that this is an additive model with the explanatory variables of latitude and tree species was the best. We will focus on how well this model demonstrates the elevational distribution of ponderosa pine.

First, we need to create an object that contains latitude values representing the range observed in our dataset for ponderosa pine. The following code does this by creating an object ‘lat’. It then adds a column for type that only has PIPO in it. We need this two column object, newpred, to have both of the explanatory variables in it. R will apply the coefficients from the model to the data in this prediction object to create new objects representing the confidence and prediction intervals.

lat<-seq(min(trees\$lat[trees\$Type==’PIPO’]),max(trees\$lat[trees\$Type==’PIPO’]),.1)

Type<-rep(‘PIPO’,length(lat))

newpred<-data.frame(lat,Type)

Now we need to create objects containing the upper and lower confidence and prediction limits based on our best model using the predict() function. The predict() function specifically works with objects created by lm() so we need to create the best model in that form using

mod.typelat<-lm(Elev~Type+lat, data=trees)

Which we now can use to create the 95% limits using

pipo.cl<-predict(gmod.typelat,newpred,interval=c(‘confidence’))

pipo.pl<-predict(gmod.typelat,newpred,interval=c(‘prediction’))

We will create a graph showing the data for ponderosa pine with latitude on the x-axis and elevation on the y-axis.

plot(trees\$lat[trees\$Type==’PIPO’],trees\$Elev[trees\$Type==’PIPO’],type=’p’,pch=20,col=2,main=’Relationship of Latitude and Elevation for Ponderosa’,xlab=’Latitude’,ylab=’Elevation’)

We then add lines showing the point estimate and 95% confidence limits for elevation at each latitude.

lines(newpred[,1],pipo.cl[,1],lwd=2,col=4)

lines(newpred[,1],pipo.cl[,2],lty=2,col=4)

lines(newpred[,1],pipo.cl[,3],lty=2,col=4)

And similarly we add the 95% prediction limits for elevation of ponderosa pine at each latitude. We will also add a legend to the graph.

lines(newpred[,1],pipo.pl[,2],lty=4,col=6)

lines(newpred[,1],pipo.pl[,3],lty=4,col=6)

legend(‘bottomleft’,c(‘Predicted Value’,’95% Confidence Limits’,’95% Prediction Limits’),col=c(4,4,6),lty=c(1,2,4),lwd=c(2,1,1))

Question note: Save an image of this graph for the questions.

Note that we are using this model to explore the explanatory potential of variables (confidence limits), and in predicting particular values (prediction limits). Observe the graph and see how the confidence interval and prediction intervals differ. For example, assume we are latitude 33 deg N and want to know what the mean elevation that ponderosa pine might be found at this elevation. We can find the point estimate manually by using the equation Elev = intercept + -48.57*latitude + 1278.44*PIPO or Elev= 2130.09+-48.57(33)+1278.44(1)= 1806 m.

The following R code extracts each coefficient from the model output and then runs this equation. The [[1]] indicates we are extracting a coefficient from this output, the number in single brackets following indicates which coefficient. This follows the listing shown on the object display so the intercept is [1], the PIPO coefficient is [5], and the latitude coefficient is last at [6] (there are 6 coefficients total). Note that we have a PIPO coefficient because we have not releveled back to PIPO as the intercept.

(int<-mod.typelat[[1]][1])

(lat.coef<-mod.typelat[[1]][6])

(pipo.coef<-mod.typelat[[1]][5]

int+lat.coef*33+pipo.coef

Question note: Make sure you have the coefficients for the mod.typelat model recorded. You will be asked to calculate the predicted elevation for a different tree species and different latitude as a question.

Note that this result is the predicted mean value on the graph for an x-axis value of 33 degrees latitude. We can see the values for the confidence and prediction limits by displaying these objects.

pipo.cl

pipo.pl

Our example was very close to the latitude of the 99th object in this list, so we can show them with

pipo.cl[99,]

pipo.pl[99,]

We believe the true mean elevation of ponderosa pine at 33 deg N latitude is between 1692 and 1919 meters. However, if someone asked us to guess the elevation of a single ponderosa pine that they found at 33 deg N latitude, we would say that it is probably between 854 and 2756 meters.

Question note: You will be asked to provide the 95% confidence and prediction limits for ponderosa pine at a different latitude based on the pipo.cl and pipo.pl objects. Know how to access these objects to find those values.

### Literature Cited

Anderson, D. R., K. P. Burnham, and W. L. Thompson. 2000. Null hypothesis testing: problems, prevalence, and an alternative. Journal of Wildlife Management 64(4):912-923.

Burnham, K. P.; and D. R. Anderson. 2002. Model selection and multimodel inference: a practical information-theoretic approach, 2nd edition. Springer-Verlag, New York, NY.

Questions: Based on the analysis you ran, answer the following:

1. What are the point estimate and 95% confidence intervals for the coefficients from the lm() regression of tree type explaining elevation?

2. What are the 95% confidence intervals on the mean elevation for the 5 tree species?

3. Based on your answers in #2, which species differ from subalpine fir in mean elevation? From piñon?

4. Of the 5 glm() models you ran, which appears to be the best model. Support your answer by providing the relevant information from your R output. You should also provide ΔAIC values for all 5 models in your answer.

5. Of your 5 models, what was the model weight (Akaike weight) of the second best model? Show your calculation of this value.

6. Provide the image of the graph showing the confidence and prediction limits for the role of latitude in explaining elevation of ponderosa pine.

7. What is the predicted elevation, in meters, for juniper (JUSC) at 38.5 deg N latitude?

8. Provide the 95% confidence and prediction limits, as well as the predicted point value of elevation for ponderosa pine at 39.006 degrees N latitude?

9. What is the difference between confidence and prediction intervals? Provide an example of when you would use each.

Order your essay today and save 10% with the discount code ESSAYHELP