Migration to http://edild.github.com/ is now complete. Octopress gives me more flexibility and all the data for the blog is now on my own github-repository (so I have direct acess to it).
I am moving from tumblr.com to my new page at github.
All the content here is also available at http://edild.github.com.
Get the data from here and read it into R:
MERCURY <- read.table("p89.csv", header = TRUE, sep = ";")
head(MERCURY)
## DAY PCI
## 1 3 27400
## 2 6 17601
## 3 10 12950
## 4 14 11157
## 5 22 9410
## 6 31 9150
plot(PCI ~ DAY, data = MERCURY)
abline(v = 22)
We can identify two compartments on this plot: A slow one after day 22 and a fast one before day 22.
First we estimate and
for the slow compartment, using linear regression of ln-transformed activity against day and predict from this slow compartment the activity over the whole period:
# ln-transformation
MERCURY$LPCI <- log(MERCURY$PCI)
# fit linear model for day 31 to 94
mod_slow <- lm(LPCI ~ DAY, data = MERCURY[MERCURY$DAY > 22, ])
sum_mod_slow <- summary(mod_slow)
sum_mod_slow
##
## Call:
## lm(formula = LPCI ~ DAY, data = MERCURY[MERCURY$DAY > 22, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2502 -0.0657 0.0605 0.0822 0.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.43096 0.13987 67.42 2.6e-12 ***
## DAY -0.01240 0.00213 -5.82 0.0004 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.135 on 8 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.785
## F-statistic: 33.9 on 1 and 8 DF, p-value: 0.000395
exp(coef(mod_slow)[1])
## (Intercept)
## 12468
So this gives us the model for the slow component.
We do a bias correction and predict the activity for the whole data-range:
# bias-correction
corr_mod_slow <- exp(sum_mod_slow$sigma^2/2)
# add bias corrected predicts to data.frame predict takes the whole
# data.frame es newdata, so we get predicts for every day.
MERCURY$PRED <- exp(predict(mod_slow, newdata = MERCURY)) * corr_mod_slow
# save C_B and k_B as objects (used later...)
CB <- exp(coef(mod_slow)[1]) * corr_mod_slow
KCB <- abs(coef(mod_slow)[2])
The residuals from these predictions for day 3 to 22 are associated with the fast compartment. And we fit a linear regression to the ln-transformed residuals for the fast compartment.
plot(LPCI ~ DAY, data = MERCURY)
points(MERCURY$DAY[1:4], MERCURY$LPCI[1:4], pch = 16)
abline(a = log(CB), b = -KCB)
for (i in 1:4) {
lines(c(MERCURY$DAY[i], MERCURY$DAY[i]), c(log(MERCURY$PRED[i]), MERCURY$LPCI[i]),
lwd = 2)
}
# extract residuals
MERCURY$ERROR <- MERCURY$PCI - MERCURY$PRED
MERCURY$ERROR[1:4]
## [1] 15276.20 5920.03 1834.41 579.42
# fit linear model to ln(residuals) for Day 3 to 22
mod_fast <- lm(log(ERROR) ~ DAY, data = MERCURY[MERCURY$DAY < 22, ])
sum_mod_fast <- summary(mod_fast)
sum_mod_fast
##
## Call:
## lm(formula = log(ERROR) ~ DAY, data = MERCURY[MERCURY$DAY < 22,
## ])
##
## Residuals:
## 1 2 3 4
## 0.0278 -0.0304 -0.0157 0.0183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.49602 0.03755 279.5 0.000013 ***
## DAY -0.29659 0.00407 -72.9 0.00019 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0337 on 2 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 0.999
## F-statistic: 5.32e+03 on 1 and 2 DF, p-value: 0.000188
exp(coef(mod_fast)[1])
## (Intercept)
## 36171
So the model for the fast component is:
# bias correction
corr_mod_fast <- exp(sum_mod_fast$sigma^2/2)
# save C_A and k_A as objects
CA <- exp(coef(mod_fast)[1]) * corr_mod_fast
KCA <- abs(coef(mod_fast)[2])
Now we have two models: one for the fast component and one for the slow component, and we can make a plot similar to Figure 8.1 in Newman and Clements (2008, pp. 119–120).
plot(LPCI ~ DAY, data = MERCURY)
abline(mod_slow)
abline(mod_fast, lty = "dotted")
legend("topright", c("slow", "backstripped-fast"), lty = c("solid", "dashed"),
cex = 0.8)
# Estimates
c(CA, KCA, CB, KCB)
## (Intercept) DAY (Intercept) DAY
## 3.6192e+04 2.9659e-01 1.2583e+04 1.2403e-02
We can use this estimates as start-values to fit a non-linear Model to the data (therefore we stored them into objects).
We want to fit the following model:
nls_mod1 <- nls(PCI ~ CA * exp(-KCA * DAY) + CB * exp(-KCB * DAY),
data = MERCURY,
algorithm = "port", # to use the bonds
start = list(KCB = KCB, KCA = KCA, CB = CB, CA = CA),
lower = c(0, 0, 5000, 20000),
upper = c(1, 1, 20000, 45000))
sum_nls_mod1 <- summary(nls_mod1)
sum_nls_mod1
##
## Formula: PCI ~ CA * exp(-KCA * DAY) + CB * exp(-KCB * DAY)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## KCB 1.19e-02 1.41e-03 8.45 3.9e-06 ***
## KCA 2.97e-01 4.47e-02 6.66 3.6e-05 ***
## CB 1.22e+04 8.58e+02 14.27 1.9e-08 ***
## CA 3.79e+04 4.82e+03 7.86 7.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 701 on 11 degrees of freedom
##
## Algorithm "port", convergence message: relative convergence (4)
is estimated as
379094824is estimated as
12245858is estimated as
0.2970.045is estimated as
0.0120.001
And finally we plot data and model.
plot(PCI ~ DAY, data = MERCURY, type = "n")
points(MERCURY$DAY, MERCURY$PCI, pch = ifelse(MERCURY$DAY <= 22, 16, 17))
# smooth line
pred_nls_mod1 <- predict(nls_mod1, newdata = data.frame(DAY = seq(0, 100, 1)))
lines(seq(0, 100, 1), pred_nls_mod1)
legend("topright", c("Fast", "slow"), pch = c(16, 17))
Again we get nearly the same results with R, except for some differences in the linear models.
This is probably due to the bias-correction in slow-component-model. We have a MSE of
sum_mod_slow$sigma^2
## [1] 0.018349
which is identical to the book. From the previous example, the bias can be estimated as
:
exp(sum_mod_slow$sigma^2/2)
## [1] 1.0092
which is different to the book (1.002).
I have no SAS at hand, so I cannot check this with SAS. However let me know if there is an error in my calculations.
Refs
Newman, Michael C., and William Henry Clements. Ecotoxicology: A Comprehensive Treatment. Boca Raton: Taylor /& Francis, 2008.
Code and data are available at my github-repo under file name ‘p89’.
Get the data from here and read it into R:
OYSTERZN <- read.table("p85.csv", header = TRUE, sep = ";")
head(OYSTERZN)
## DAY ZINC
## 1 1 700
## 2 1 695
## 3 1 675
## 4 1 630
## 5 1 606
## 6 1 540
plot(ZINC ~ DAY, data = OYSTERZN)
First we fit a nonlinear Regression without weighting.
mod_noweight <- nls(ZINC ~ INITACT * exp((-(KE + 0.00283)) * DAY), data = OYSTERZN,
start = list(KE = 0.004, INITACT = 500))
So we fit the model
to our data.
In the R formula ZINC corresponds to ,
INITACT to ,
KE to ,
0.00283 is the decay rate constant for 65-Zn
and DAY to .
Were are going to estimate KE and INITACT and also supplied some start-values for the algorithm.
We can look a the summary to get the estimates and standard error:
summary(mod_noweight)
##
## Formula: ZINC ~ INITACT * exp((-(KE + 0.00283)) * DAY)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## KE 2.68e-03 6.68e-04 4.01 0.00015 ***
## INITACT 4.65e+02 2.04e+01 22.86 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98.8 on 71 degrees of freedom
##
## Number of iterations to convergence: 3
## Achieved convergence tolerance: 5.57e-06
The resulting estimates of and
are
0.00268 0.00067 and 465 20.
We can investigate the residuals, which show a clear pattern:
res_mod_noweight <- resid(mod_noweight)
plot(OYSTERZN$DAY, res_mod_noweight)
abline(h = 0)
Secondly, we run a nonlinear regression with day-squared weighting:
We use day2 as weights and add there a column to our data:
OYSTERZN$WNLIN <- OYSTERZN$DAY^2
We run again nls, but now we supply this new column as weights:
mod_weight <- nls(ZINC ~ INITACT * exp((-(KE + 0.00283)) * DAY), data = OYSTERZN,
weights = OYSTERZN$WNLIN, start = list(KE = 0.004, INITACT = 500))
summary(mod_weight)
##
## Formula: ZINC ~ INITACT * exp((-(KE + 0.00283)) * DAY)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## KE 2.44e-03 3.47e-04 7.03 1.1e-09 ***
## INITACT 4.55e+02 3.82e+01 11.89 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7550 on 71 degrees of freedom
##
## Number of iterations to convergence: 4
## Achieved convergence tolerance: 5.42e-07
The estimates (0.00244 0.00035 and 455 38 are quite similar to the non weighted regression.
We could plot the two models and the data:
# extract the fitted values
fit_mod_noweight <- fitted(mod_noweight)
fit_mod_weight <- fitted(mod_weight)
# plot data
plot(ZINC ~ DAY, data = OYSTERZN)
# add fitted values
lines(OYSTERZN$DAY, fit_mod_noweight)
lines(OYSTERZN$DAY, fit_mod_weight, lty = "dashed")
# add legend
legend("topright", legend = c("nonweighted", "weighted"), lty = c("solid", "dashed"))
Finally we can also fit a linear model to the transformed Zinc-Concentrations:
First we ln-transform the concentrations:
OYSTERZN$LZINC <- log(OYSTERZN$ZINC)
We see that the data has now linear trend:
plot(LZINC ~ DAY, OYSTERZN)
And fit a linear regression:
mod_lm <- lm(LZINC ~ DAY, data = OYSTERZN)
sum_lm <- summary(mod_lm)
sum_lm
##
## Call:
## lm(formula = LZINC ~ DAY, data = OYSTERZN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8974 -0.2448 -0.0709 0.2958 0.5715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.073833 0.056834 106.9 <2e-16 ***
## DAY -0.005314 0.000243 -21.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.354 on 71 degrees of freedom
## Multiple R-squared: 0.871, Adjusted R-squared: 0.869
## F-statistic: 478 on 1 and 71 DF, p-value: <2e-16
which is fitting the model with a =
-0.0053 and intercept = 6.07
Now plot data and model, as well as the residuals:
# fitted values
fit_mod_lm <- fitted(mod_lm)
# data + fitted
plot(LZINC ~ DAY, OYSTERZN)
lines(OYSTERZN$DAY, fit_mod_lm)
# residuals
plot(mod_lm, which = 1)
The mean square error can be calculated from the summary:
# MSE
sum_lm$sigma^2
## [1] 0.12516
From which we can get an unbiased estimate of $C_0$:
# unbiased estimate for C_0: exp(MSE/2) * exp(Intercept)
exp(sum_lm$sigma^2/2) * exp(sum_lm$coefficients[1, 1])
## [1] 462.39
where
sum_lm$coefficients[1, 1]
extracts the intercept from the summary.
The estimated k in the summary output is -0.00531 0.00024, and .
This result is similar to the weighted and non weighted nonlinear regression.
Again we have the same results as with SAS :) [Small deviations may be due to rounding error]
Code and data are available at my github-repo under file name ‘p85’.
Get the data from here and read it into R:
CADMIUM <- read.table("p45.csv", header = TRUE, sep = ";")
‘Flip’ the data:
CADMIUM$FLIP <- abs(CADMIUM$CD - 100)
CADMIUM
## CD SITE FLAG FLIP
## 1 81.3 A 1 18.7
## 2 4.9 A 1 95.1
## 3 4.6 A 1 95.4
## 4 3.5 A 1 96.5
## 5 3.4 A 1 96.6
## 6 3.0 A 1 97.0
## 7 2.9 A 1 97.1
## 8 1.4 B 1 98.6
## 9 0.8 B 1 99.2
## 10 0.7 B 1 99.3
## 11 0.6 A 1 99.4
## 12 0.6 A 1 99.4
## 13 0.6 B 0 99.4
## 14 0.4 B 1 99.6
## 15 0.4 B 1 99.6
## 16 0.4 B 1 99.6
## 17 0.4 B 0 99.6
## 18 0.3 B 0 99.7
## 19 0.2 A 0 99.8
And test for differences using survdiff from the survival package:
require(survival)
# log-rank
fit <- survdiff(Surv(FLIP, FLAG) ~ SITE, data = CADMIUM, rho = 0)
fit
## Call:
## survdiff(formula = Surv(FLIP, FLAG) ~ SITE, data = CADMIUM, rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## SITE=A 10 9 4.99 3.23 5.53
## SITE=B 9 6 10.01 1.61 5.53
##
## Chisq= 5.5 on 1 degrees of freedom, p= 0.0187
# Peto & Peto modification of the Gehan-Wilcoxon test
fit2 <- survdiff(Surv(FLIP, FLAG) ~ SITE, data = CADMIUM, rho = 1)
fit2
## Call:
## survdiff(formula = Surv(FLIP, FLAG) ~ SITE, data = CADMIUM, rho = 1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## SITE=A 10 6.84 3.55 3.05 7.02
## SITE=B 9 2.84 6.13 1.76 7.02
##
## Chisq= 7 on 1 degrees of freedom, p= 0.00808
Code and data are available at my github-repo under file name ‘p45’.
Get the data from here and read it into R:
SULFATE <- read.table("p42.csv", header = TRUE, sep = ";")
Lets first have a look at the data via a violin plot:
require(ggplot2)
ggplot(SULFATE, aes(x = SITE, y = SO4)) + geom_violin()
It is quite easy to perform a wilcoxon-test with the function wilcox.test:
wilcox.test(SO4 ~ SITE, data = SULFATE, correct = TRUE)
## Warning: cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: SO4 by SITE
## W = 330.5, p-value = 0.00563
## alternative hypothesis: true location shift is not equal to 0
It works with the usual formula-notation, additional I specified the continuity correction.
For a one-sided test we can specify the argument ‘alternative’:
wilcox.test(SO4 ~ SITE, data = SULFATE, correct = TRUE, alternative = "greater")
## Warning: cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: SO4 by SITE
## W = 330.5, p-value = 0.002815
## alternative hypothesis: true location shift is greater than 0
The p-values are the same as with SAS, however we get a warning since we have ties in our data (I don’t know how SAS handles this):
cannot compute exact p-value with ties
If we want to compute exact p-values in the presence of ties, we could use wilcox_test() from the coin package:
require(coin)
wilcox_test(SO4 ~ SITE, SULFATE, distribution = "exact", conf.int = TRUE)
##
## Exact Wilcoxon Mann-Whitney Rank Sum Test
##
## data: SO4 by SITE (A, B)
## Z = 2.781, p-value = 0.004727
## alternative hypothesis: true mu is not equal to 0
## 95 percent confidence interval:
## 0.4 2.8
## sample estimates:
## difference in location
## 1.5
Here I also specified to output the 95%-Confidence-Interval via the ‘conf.int argument.
Code and data are available at my github-repo under file name ‘p42’.
Get the data from here and read it into R:
SULFATE <- read.table("p39.csv", header = TRUE, sep = ";")
Convert left to right censored data:
SULFATE$FLIP <- abs(SULFATE$SO4 - 8)
SULFATE
## SO4 FLAG FLIP
## 1 7.9 1 0.1
## 2 7.7 1 0.3
## 3 7.1 1 0.9
## 4 6.9 1 1.1
## 5 6.5 1 1.5
## 6 6.2 1 1.8
## 7 6.1 1 1.9
## 8 5.7 1 2.3
## 9 5.6 1 2.4
## 10 5.2 1 2.8
## 11 4.5 1 3.5
## 12 4.1 1 3.9
## 13 4.0 1 4.0
## 14 3.6 1 4.4
## 15 3.5 1 4.5
## 16 3.5 1 4.5
## 17 3.3 1 4.7
## 18 2.6 1 5.4
## 19 2.5 0 5.5
## 20 2.5 0 5.5
The Kaplan-Meier estimates can be calculated using survfit() from the survival package:
require(survival)
fit <- survfit(Surv(FLIP, FLAG) ~ 1, data = SULFATE, conf.type = "plain")
fit
## Call: survfit(formula = Surv(FLIP, FLAG) ~ 1, data = SULFATE, conf.type = "plain")
##
## records n.max n.start events median 0.95LCL 0.95UCL
## 20.00 20.00 20.00 18.00 3.15 1.80 4.50
I set conf.type=“plain” to be concordant with ‘CONFTYPE=LINEAR’ from SAS.
The median of 3.15, 95% CI [1.8, 4.5] is the same as with SAS.
Finally a quick plot:
plot(fit)
Code and data are available at my github-repo under filename ‘p39’.
Get the data from here and read it into R:
SO4 <- read.table("p35.csv", header = TRUE, sep = ";")
First we need to convert the column indicating if an observation is censored to TRUE/FALSE:
I store it here as a new colum called ‘rem2’ (you could also overwrite df$rem):
SO4$rem2 <- ifelse(SO4$rem == "<", TRUE, FALSE)
SO4
## value rem rem2
## 1 2.5 < TRUE
## 2 2.5 < TRUE
## 3 2.6 X FALSE
## 4 3.3 X FALSE
## 5 3.5 X FALSE
## 6 3.5 X FALSE
## 7 3.6 X FALSE
## 8 4.0 X FALSE
## 9 4.1 X FALSE
## 10 4.5 X FALSE
## 11 5.2 X FALSE
## 12 5.6 X FALSE
## 13 5.7 X FALSE
## 14 6.1 X FALSE
## 15 6.2 X FALSE
## 16 6.5 X FALSE
## 17 6.9 X FALSE
## 18 7.1 X FALSE
## 19 7.7 X FALSE
## 20 7.9 X FALSE
## 21 9.9 X FALSE
Then we can run the Robust Regression on Order Statistics with the ros() function from the NADA package:
require(NADA)
rs <- ros(SO4$value, SO4$rem2)
print(rs)
## n n.cen median mean sd
## 21.000 2.000 5.200 5.158 2.071
Which gives the same mean and standard deviation as the SAS-Makro (5.16 and 2.07).
Code and data are available at my github-repo under filename ‘p35’.
Get the data (Sulfate Concentrations from Savannah River (South Carolina) in mg / L)) from here and read it into R:
ALL <- read.table("p33.csv", header = TRUE, sep = ";")
So we have a data.frame with one variable and 21 observations:
str(ALL)
## 'data.frame': 21 obs. of 1 variable:
## $ SO4: num 1.3 2.3 2.6 3.3 3.5 3.5 3.6 4 4.1 4.5 ...
ALL$SO4
## [1] 1.3 2.3 2.6 3.3 3.5 3.5 3.6 4.0 4.1 4.5 5.2 5.6 5.7 6.1 6.2 6.5 6.9
## [18] 7.1 7.7 7.9 9.9
Winsorization replaces extreme data values with less extreme values. I have written a small function to run the winsorisation:
winsori <- function(x, width = 2) {
# check if sorted
if (is.unsorted(x))
stop("Values must be sorted!")
# get number of observations
n <- length(x)
# Replace lowest
x[1:width] <- x[width + 1]
# replace highest
x[(n - width + 1):n] <- x[(n - width)]
x
}
The function takes a ordered vector and replaces the 2 highest and 2 lowest values (can be changed by the ‘width’-Argument by their neighbors.
We can apply this function to our data and safe it as new column:
ALL$SO4_win <- winsori(ALL$SO4)
# display the first and 5 last rows
ALL[c(1:5, 17:21), ]
## SO4 SO4_win
## 1 1.3 2.6
## 2 2.3 2.6
## 3 2.6 2.6
## 4 3.3 3.3
## 5 3.5 3.5
## 17 6.9 6.9
## 18 7.1 7.1
## 19 7.7 7.7
## 20 7.9 7.7
## 21 9.9 7.7
Worked as expected.
The Winsorized mean and standard-deviation is:
# mean
mean(ALL$SO4_win)
## [1] 5.081
# standard deviation
sd(ALL$SO4_win)
## [1] 1.792
For the Winsorized Standard Deviation we need again a homemade function:
sw <- function(x, width = 2) {
n <- length(x)
sd(x) * (n - 1)/(n - 2 * width - 1)
}
sw(ALL$SO4_win)
## [1] 2.24
And lastly we calculate the mean for the trimmed data (remove two observation from each tail):
mean(ALL$SO4, trim = 2/21)
## [1] 5.065
Code and data are available at my github-repo under filename ‘p33’.
There is a fresh book about Quantitative Ecotoxicology:
Newman, Michael C. (2012). Quantitative Ecotoxicology, Second Edition. CRC Press
Unfortunately all examples are written in SAS, to which I have no access (5.530€ for a licence of SAS Analytics Pro).
Since I am a poor student who likes R and ecotoxicology, I will try to reproduce the examples given in the book.
All code and data are available at my github-repository.