http://edild.github.com/

Migration to http://edild.github.com/ is now complete. Octopress gives me more flexibility and all the data for the blog is now on my own github-repository (so I have direct acess to it).

So visit my new site at http://edild.github.com/ !

Migrating to http://edild.github.com!

I am moving from tumblr.com to my new page at github.
All the content here is also available at http://edild.github.com.

Quantitative Ecotoxicology, page 85, example 3.3, Backstripping

Get the data from here and read it into R:

##   DAY   PCI
## 1   3 27400
## 2   6 17601
## 3  10 12950
## 4  14 11157
## 5  22  9410
## 6  31  9150
plot(PCI ~ DAY, data = MERCURY)
abline(v = 22)

We can identify two compartments on this plot: A slow one after day 22 and a fast one before day 22.

First we estimate and for the slow compartment, using linear regression of ln-transformed activity against day and predict from this slow compartment the activity over the whole period:

# ln-transformation
MERCURY$LPCI <- log(MERCURY$PCI)
# fit linear model for day 31 to 94
mod_slow <- lm(LPCI ~ DAY, data = MERCURY[MERCURY$DAY > 22, ]) sum_mod_slow <- summary(mod_slow) sum_mod_slow ## ## Call: ## lm(formula = LPCI ~ DAY, data = MERCURY[MERCURY$DAY > 22, ])
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -0.2502 -0.0657  0.0605  0.0822  0.1386
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  9.43096    0.13987   67.42  2.6e-12 ***
## DAY         -0.01240    0.00213   -5.82   0.0004 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.135 on 8 degrees of freedom
## Multiple R-squared: 0.809,   Adjusted R-squared: 0.785
## F-statistic: 33.9 on 1 and 8 DF,  p-value: 0.000395
exp(coef(mod_slow)[1])
## (Intercept)
##       12468

So this gives us the model for the slow component.

We do a bias correction and predict the activity for the whole data-range:

# bias-correction
corr_mod_slow <- exp(sum_mod_slow$sigma^2/2) # add bias corrected predicts to data.frame predict takes the whole # data.frame es newdata, so we get predicts for every day. MERCURY$PRED <- exp(predict(mod_slow, newdata = MERCURY)) * corr_mod_slow
# save C_B and k_B as objects (used later...)
CB <- exp(coef(mod_slow)[1]) * corr_mod_slow
KCB <- abs(coef(mod_slow)[2])

The residuals from these predictions for day 3 to 22 are associated with the fast compartment. And we fit a linear regression to the ln-transformed residuals for the fast compartment.

plot(LPCI ~ DAY, data = MERCURY)
points(MERCURY$DAY[1:4], MERCURY$LPCI[1:4], pch = 16)
abline(a = log(CB), b = -KCB)
for (i in 1:4) {
lines(c(MERCURY$DAY[i], MERCURY$DAY[i]), c(log(MERCURY$PRED[i]), MERCURY$LPCI[i]),
lwd = 2)
}

# extract residuals
MERCURY$ERROR <- MERCURY$PCI - MERCURY$PRED MERCURY$ERROR[1:4]
## [1] 15276.20  5920.03  1834.41   579.42
# fit linear model to ln(residuals) for Day 3 to 22
mod_fast <- lm(log(ERROR) ~ DAY, data = MERCURY[MERCURY$DAY < 22, ]) sum_mod_fast <- summary(mod_fast) sum_mod_fast ## ## Call: ## lm(formula = log(ERROR) ~ DAY, data = MERCURY[MERCURY$DAY < 22,
##     ])
##
## Residuals:
##       1       2       3       4
##  0.0278 -0.0304 -0.0157  0.0183
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.49602    0.03755   279.5 0.000013 ***
## DAY         -0.29659    0.00407   -72.9  0.00019 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0337 on 2 degrees of freedom
## Multiple R-squared:    1,    Adjusted R-squared: 0.999
## F-statistic: 5.32e+03 on 1 and 2 DF,  p-value: 0.000188
exp(coef(mod_fast)[1])
## (Intercept)
##       36171

So the model for the fast component is:

# bias correction
corr_mod_fast <- exp(sum_mod_fast$sigma^2/2) # save C_A and k_A as objects CA <- exp(coef(mod_fast)[1]) * corr_mod_fast KCA <- abs(coef(mod_fast)[2]) Now we have two models: one for the fast component and one for the slow component, and we can make a plot similar to Figure 8.1 in Newman and Clements (2008, pp. 119–120). plot(LPCI ~ DAY, data = MERCURY) abline(mod_slow) abline(mod_fast, lty = "dotted") legend("topright", c("slow", "backstripped-fast"), lty = c("solid", "dashed"), cex = 0.8) # Estimates c(CA, KCA, CB, KCB) ## (Intercept) DAY (Intercept) DAY ## 3.6192e+04 2.9659e-01 1.2583e+04 1.2403e-02 We can use this estimates as start-values to fit a non-linear Model to the data (therefore we stored them into objects). We want to fit the following model: nls_mod1 <- nls(PCI ~ CA * exp(-KCA * DAY) + CB * exp(-KCB * DAY), data = MERCURY, algorithm = "port", # to use the bonds start = list(KCB = KCB, KCA = KCA, CB = CB, CA = CA), lower = c(0, 0, 5000, 20000), upper = c(1, 1, 20000, 45000)) sum_nls_mod1 <- summary(nls_mod1) sum_nls_mod1 ## ## Formula: PCI ~ CA * exp(-KCA * DAY) + CB * exp(-KCB * DAY) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## KCB 1.19e-02 1.41e-03 8.45 3.9e-06 *** ## KCA 2.97e-01 4.47e-02 6.66 3.6e-05 *** ## CB 1.22e+04 8.58e+02 14.27 1.9e-08 *** ## CA 3.79e+04 4.82e+03 7.86 7.7e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 701 on 11 degrees of freedom ## ## Algorithm "port", convergence message: relative convergence (4) • is estimated as 37909 4824 • is estimated as 12245 858 • is estimated as 0.297 0.045 • is estimated as 0.012 0.001 And finally we plot data and model. plot(PCI ~ DAY, data = MERCURY, type = "n") points(MERCURY$DAY, MERCURY$PCI, pch = ifelse(MERCURY$DAY <= 22, 16, 17))
# smooth line
pred_nls_mod1 <- predict(nls_mod1, newdata = data.frame(DAY = seq(0, 100, 1)))
lines(seq(0, 100, 1), pred_nls_mod1)
legend("topright", c("Fast", "slow"), pch = c(16, 17))

Again we get nearly the same results with R, except for some differences in the linear models.

This is probably due to the bias-correction in slow-component-model. We have a MSE of

sum_mod_slow$sigma^2 ## [1] 0.018349 which is identical to the book. From the previous example, the bias can be estimated as : exp(sum_mod_slow$sigma^2/2)
## [1] 1.0092

which is different to the book (1.002).

I have no SAS at hand, so I cannot check this with SAS. However let me know if there is an error in my calculations.

Refs

Newman, Michael C., and William Henry Clements. Ecotoxicology: A Comprehensive Treatment. Boca Raton: Taylor /& Francis, 2008.

Code and data are available at my github-repo under file name ‘p89’.

Quantitative Ecotoxicology, page 85, example 3.2, Nonlinear Regression

Get the data from here and read it into R:

##   DAY ZINC
## 1   1  700
## 2   1  695
## 3   1  675
## 4   1  630
## 5   1  606
## 6   1  540
plot(ZINC ~ DAY, data = OYSTERZN)

First we fit a nonlinear Regression without weighting.

mod_noweight <- nls(ZINC ~ INITACT * exp((-(KE + 0.00283)) * DAY), data = OYSTERZN,
start = list(KE = 0.004, INITACT = 500))

So we fit the model

to our data.

In the R formula ZINC corresponds to ,
INITACT to ,
KE to ,
0.00283 is the decay rate constant for 65-Zn
and DAY to .

Were are going to estimate KE and INITACT and also supplied some start-values for the algorithm.

We can look a the summary to get the estimates and standard error:

summary(mod_noweight)
##
## Formula: ZINC ~ INITACT * exp((-(KE + 0.00283)) * DAY)
##
## Parameters:
##         Estimate Std. Error t value Pr(>|t|)
## KE      2.68e-03   6.68e-04    4.01  0.00015 ***
## INITACT 4.65e+02   2.04e+01   22.86  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98.8 on 71 degrees of freedom
##
## Number of iterations to convergence: 3
## Achieved convergence tolerance: 5.57e-06

The resulting estimates of and are 0.00268 0.00067 and 465 20.

We can investigate the residuals, which show a clear pattern:

res_mod_noweight <- resid(mod_noweight)
plot(OYSTERZN$DAY, res_mod_noweight) abline(h = 0) Secondly, we run a nonlinear regression with day-squared weighting: We use day2 as weights and add there a column to our data: OYSTERZN$WNLIN <- OYSTERZN$DAY^2 We run again nls, but now we supply this new column as weights: mod_weight <- nls(ZINC ~ INITACT * exp((-(KE + 0.00283)) * DAY), data = OYSTERZN, weights = OYSTERZN$WNLIN, start = list(KE = 0.004, INITACT = 500))
summary(mod_weight)
##
## Formula: ZINC ~ INITACT * exp((-(KE + 0.00283)) * DAY)
##
## Parameters:
##         Estimate Std. Error t value Pr(>|t|)
## KE      2.44e-03   3.47e-04    7.03  1.1e-09 ***
## INITACT 4.55e+02   3.82e+01   11.89  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7550 on 71 degrees of freedom
##
## Number of iterations to convergence: 4
## Achieved convergence tolerance: 5.42e-07

The estimates (0.00244 0.00035 and 455 38 are quite similar to the non weighted regression.

We could plot the two models and the data:

# extract the fitted values
fit_mod_noweight <- fitted(mod_noweight)
fit_mod_weight <- fitted(mod_weight)
# plot data
plot(ZINC ~ DAY, data = OYSTERZN)
lines(OYSTERZN$DAY, fit_mod_noweight) lines(OYSTERZN$DAY, fit_mod_weight, lty = "dashed")
legend("topright", legend = c("nonweighted", "weighted"), lty = c("solid", "dashed"))

Finally we can also fit a linear model to the transformed Zinc-Concentrations:

First we ln-transform the concentrations:

OYSTERZN$LZINC <- log(OYSTERZN$ZINC)

We see that the data has now linear trend:

plot(LZINC ~ DAY, OYSTERZN)

And fit a linear regression:

mod_lm <- lm(LZINC ~ DAY, data = OYSTERZN)
sum_lm <- summary(mod_lm)
sum_lm
##
## Call:
## lm(formula = LZINC ~ DAY, data = OYSTERZN)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -0.8974 -0.2448 -0.0709  0.2958  0.5715
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  6.073833   0.056834   106.9   <2e-16 ***
## DAY         -0.005314   0.000243   -21.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.354 on 71 degrees of freedom
## Multiple R-squared: 0.871,   Adjusted R-squared: 0.869
## F-statistic:  478 on 1 and 71 DF,  p-value: <2e-16

which is fitting the model
with a = -0.0053 and intercept = 6.07

Now plot data and model, as well as the residuals:

# fitted values
fit_mod_lm <- fitted(mod_lm)

# data + fitted
plot(LZINC ~ DAY, OYSTERZN)
lines(OYSTERZN$DAY, fit_mod_lm) # residuals plot(mod_lm, which = 1) The mean square error can be calculated from the summary: # MSE sum_lm$sigma^2
## [1] 0.12516

From which we can get an unbiased estimate of $C_0$:

# unbiased estimate for C_0: exp(MSE/2) * exp(Intercept)
exp(sum_lm$sigma^2/2) * exp(sum_lm$coefficients[1, 1])
## [1] 462.39

where

sum_lm$coefficients[1, 1] extracts the intercept from the summary. The estimated k in the summary output is -0.00531 0.00024, and . This result is similar to the weighted and non weighted nonlinear regression. Again we have the same results as with SAS :) [Small deviations may be due to rounding error] Code and data are available at my github-repo under file name ‘p85’. Quantitative Ecotoxicology, page 45, example 2.5, Gehan-Test Get the data from here and read it into R: CADMIUM <- read.table("p45.csv", header = TRUE, sep = ";") 'Flip' the data: CADMIUM$FLIP <- abs(CADMIUM$CD - 100) CADMIUM ## CD SITE FLAG FLIP ## 1 81.3 A 1 18.7 ## 2 4.9 A 1 95.1 ## 3 4.6 A 1 95.4 ## 4 3.5 A 1 96.5 ## 5 3.4 A 1 96.6 ## 6 3.0 A 1 97.0 ## 7 2.9 A 1 97.1 ## 8 1.4 B 1 98.6 ## 9 0.8 B 1 99.2 ## 10 0.7 B 1 99.3 ## 11 0.6 A 1 99.4 ## 12 0.6 A 1 99.4 ## 13 0.6 B 0 99.4 ## 14 0.4 B 1 99.6 ## 15 0.4 B 1 99.6 ## 16 0.4 B 1 99.6 ## 17 0.4 B 0 99.6 ## 18 0.3 B 0 99.7 ## 19 0.2 A 0 99.8 And test for differences using survdiff from the survival package: require(survival) # log-rank fit <- survdiff(Surv(FLIP, FLAG) ~ SITE, data = CADMIUM, rho = 0) fit ## Call: ## survdiff(formula = Surv(FLIP, FLAG) ~ SITE, data = CADMIUM, rho = 0) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## SITE=A 10 9 4.99 3.23 5.53 ## SITE=B 9 6 10.01 1.61 5.53 ## ## Chisq= 5.5 on 1 degrees of freedom, p= 0.0187 # Peto & Peto modification of the Gehan-Wilcoxon test fit2 <- survdiff(Surv(FLIP, FLAG) ~ SITE, data = CADMIUM, rho = 1) fit2 ## Call: ## survdiff(formula = Surv(FLIP, FLAG) ~ SITE, data = CADMIUM, rho = 1) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## SITE=A 10 6.84 3.55 3.05 7.02 ## SITE=B 9 2.84 6.13 1.76 7.02 ## ## Chisq= 7 on 1 degrees of freedom, p= 0.00808 Code and data are available at my github-repo under file name ‘p45’. Quantitative Ecotoxicology, page 42, example 2.4, Wilcoxon rank sum test Get the data from here and read it into R: SULFATE <- read.table("p42.csv", header = TRUE, sep = ";") Lets first have a look at the data via a violin plot: require(ggplot2) ggplot(SULFATE, aes(x = SITE, y = SO4)) + geom_violin() It is quite easy to perform a wilcoxon-test with the function wilcox.test: wilcox.test(SO4 ~ SITE, data = SULFATE, correct = TRUE) ## Warning: cannot compute exact p-value with ties ## ## Wilcoxon rank sum test with continuity correction ## ## data: SO4 by SITE ## W = 330.5, p-value = 0.00563 ## alternative hypothesis: true location shift is not equal to 0 It works with the usual formula-notation, additional I specified the continuity correction. For a one-sided test we can specify the argument ‘alternative’: wilcox.test(SO4 ~ SITE, data = SULFATE, correct = TRUE, alternative = "greater") ## Warning: cannot compute exact p-value with ties ## ## Wilcoxon rank sum test with continuity correction ## ## data: SO4 by SITE ## W = 330.5, p-value = 0.002815 ## alternative hypothesis: true location shift is greater than 0 The p-values are the same as with SAS, however we get a warning since we have ties in our data (I don’t know how SAS handles this): cannot compute exact p-value with ties If we want to compute exact p-values in the presence of ties, we could use wilcox_test() from the coin package: require(coin) wilcox_test(SO4 ~ SITE, SULFATE, distribution = "exact", conf.int = TRUE) ## ## Exact Wilcoxon Mann-Whitney Rank Sum Test ## ## data: SO4 by SITE (A, B) ## Z = 2.781, p-value = 0.004727 ## alternative hypothesis: true mu is not equal to 0 ## 95 percent confidence interval: ## 0.4 2.8 ## sample estimates: ## difference in location ## 1.5 Here I also specified to output the 95%-Confidence-Interval via the ‘conf.int argument. Code and data are available at my github-repo under file name ‘p42’. Quantitative Ecotoxicology, page 39, example 2.3, Kaplan–Meier estimates Get the data from here and read it into R: SULFATE <- read.table("p39.csv", header = TRUE, sep = ";") Convert left to right censored data: SULFATE$FLIP <- abs(SULFATE$SO4 - 8) SULFATE ## SO4 FLAG FLIP ## 1 7.9 1 0.1 ## 2 7.7 1 0.3 ## 3 7.1 1 0.9 ## 4 6.9 1 1.1 ## 5 6.5 1 1.5 ## 6 6.2 1 1.8 ## 7 6.1 1 1.9 ## 8 5.7 1 2.3 ## 9 5.6 1 2.4 ## 10 5.2 1 2.8 ## 11 4.5 1 3.5 ## 12 4.1 1 3.9 ## 13 4.0 1 4.0 ## 14 3.6 1 4.4 ## 15 3.5 1 4.5 ## 16 3.5 1 4.5 ## 17 3.3 1 4.7 ## 18 2.6 1 5.4 ## 19 2.5 0 5.5 ## 20 2.5 0 5.5 The Kaplan-Meier estimates can be calculated using survfit() from the survival package: require(survival) fit <- survfit(Surv(FLIP, FLAG) ~ 1, data = SULFATE, conf.type = "plain") fit ## Call: survfit(formula = Surv(FLIP, FLAG) ~ 1, data = SULFATE, conf.type = "plain") ## ## records n.max n.start events median 0.95LCL 0.95UCL ## 20.00 20.00 20.00 18.00 3.15 1.80 4.50 I set conf.type=“plain” to be concordant with ‘CONFTYPE=LINEAR’ from SAS. The median of 3.15, 95% CI [1.8, 4.5] is the same as with SAS. Finally a quick plot: plot(fit) Code and data are available at my github-repo under filename ‘p39’. Quantitative Ecotoxicology, page 35, Robust Regression on Order Statistics: Get the data from here and read it into R: SO4 <- read.table("p35.csv", header = TRUE, sep = ";") First we need to convert the column indicating if an observation is censored to TRUE/FALSE: I store it here as a new colum called ‘rem2’ (you could also overwrite df$rem):

SO4$rem2 <- ifelse(SO4$rem == "<", TRUE, FALSE)
SO4
##    value rem  rem2
## 1    2.5   <  TRUE
## 2    2.5   <  TRUE
## 3    2.6   X FALSE
## 4    3.3   X FALSE
## 5    3.5   X FALSE
## 6    3.5   X FALSE
## 7    3.6   X FALSE
## 8    4.0   X FALSE
## 9    4.1   X FALSE
## 10   4.5   X FALSE
## 11   5.2   X FALSE
## 12   5.6   X FALSE
## 13   5.7   X FALSE
## 14   6.1   X FALSE
## 15   6.2   X FALSE
## 16   6.5   X FALSE
## 17   6.9   X FALSE
## 18   7.1   X FALSE
## 19   7.7   X FALSE
## 20   7.9   X FALSE
## 21   9.9   X FALSE

Then we can run the Robust Regression on Order Statistics with the ros() function from the NADA package:

rs <- ros(SO4$value, SO4$rem2)
print(rs)
##      n  n.cen median   mean     sd
## 21.000  2.000  5.200  5.158  2.071

Which gives the same mean and standard deviation as the SAS-Makro (5.16 and 2.07).

Code and data are available at my github-repo under filename ‘p35’.

Quantitative Ecotoxicology, page 33, example 2.1, Winsorization

Get the data (Sulfate Concentrations from Savannah River (South Carolina) in mg / L)) from here and read it into R:

So we have a data.frame with one variable and 21 observations:

str(ALL)
## 'data.frame':    21 obs. of  1 variable:
##  $SO4: num 1.3 2.3 2.6 3.3 3.5 3.5 3.6 4 4.1 4.5 ... ALL$SO4
##  [1] 1.3 2.3 2.6 3.3 3.5 3.5 3.6 4.0 4.1 4.5 5.2 5.6 5.7 6.1 6.2 6.5 6.9
## [18] 7.1 7.7 7.9 9.9

Winsorization replaces extreme data values with less extreme values. I have written a small function to run the winsorisation:

winsori <- function(x, width = 2) {
# check if sorted
if (is.unsorted(x))
stop("Values must be sorted!")
# get number of observations
n <- length(x)
# Replace lowest
x[1:width] <- x[width + 1]
# replace highest
x[(n - width + 1):n] <- x[(n - width)]
x
}

The function takes a ordered vector and replaces the 2 highest and 2 lowest values (can be changed by the ‘width’-Argument by their neighbors.

We can apply this function to our data and safe it as new column:

ALL$SO4_win <- winsori(ALL$SO4)
# display the first and 5 last rows
ALL[c(1:5, 17:21), ]
##    SO4 SO4_win
## 1  1.3     2.6
## 2  2.3     2.6
## 3  2.6     2.6
## 4  3.3     3.3
## 5  3.5     3.5
## 17 6.9     6.9
## 18 7.1     7.1
## 19 7.7     7.7
## 20 7.9     7.7
## 21 9.9     7.7

Worked as expected.
The Winsorized mean and standard-deviation is:

# mean
mean(ALL$SO4_win) ## [1] 5.081 # standard deviation sd(ALL$SO4_win)
## [1] 1.792

For the Winsorized Standard Deviation we need again a homemade function:

sw <- function(x, width = 2) {
n <- length(x)
sd(x) * (n - 1)/(n - 2 * width - 1)
}
sw(ALL$SO4_win) ## [1] 2.24 And lastly we calculate the mean for the trimmed data (remove two observation from each tail): mean(ALL$SO4, trim = 2/21)
## [1] 5.065

Code and data are available at my github-repo under filename ‘p33’.

Quantitative Ecotoxicology with R

There is a fresh book about Quantitative Ecotoxicology:

Newman, Michael C. (2012). Quantitative Ecotoxicology, Second Edition. CRC Press

Unfortunately all examples are written in SAS, to which I have no access (5.530€ for a licence of SAS Analytics Pro).

Since I am a poor student who likes R and ecotoxicology, I will try to reproduce the examples given in the book.

All code and data are available at my github-repository.