To illustrate the possibilities of speaq 2.0, the wine dataset is used (available form the University of Copenhagen at models.life.ku.dk). This dataset is also utilized in the paper so with this vignette you will be able to reproduce the results.
Before we start with the example let’s first recap what the new method in speaq 2.0 encompass:
getWaveletPeaks()
PeakGrouper()
PeakFilling()
BuildFeatureMatrix()
library(speaq)
data(Winedata)
Spectra.wine <- as.matrix(Winedata$spectra )
ppm.wine <- as.numeric(Winedata$ppm)
wine.color <- Winedata$wine.color
wine.origin <- Winedata$origin
# all spectra
speaq::drawSpecPPM(Y.spec = Spectra.wine,
X.ppm = ppm.wine,
title = 'Wine data spectra',
groupFactor = wine.color,
legend.extra.x = 1,
legend.extra.y = 1.1)
The drawSpecPPM()
plotting function indicates that there
might be a gap in the data (which is correct as the original authors
deleted the region between 5.0 and 4.5 ppm). This warning can be of
importance for the interpretation of the plot as in this case the
original authors deleted the data, not by setting the values to 0, but
by effectively removing it from the matrix and ppm vector. This produces
a plot that appears continuous but is in fact not.
The drawSpecPPM()
function also indicates that the
groupFactor is not a factor and (successfully) attempts to do the
conversion. The next plot is an excerpt of the wine NMR spectra.
Now that we’ve had a look at the spectra it is time to convert these
to peaks by using the getWaveletPeaks()
function. This
takes about 50 seconds, with nCPU = 4
on a 2.5GHz machine
(nCPU
is set to 1 for the vignette build but should be
changed).
wine.peaks <- speaq::getWaveletPeaks(Y.spec=Spectra.wine,
X.ppm=ppm.wine,
baselineThresh = 10,
SNR.Th = -1,
nCPU = 2,
include_nearbyPeaks = TRUE) # nCPU set to 2 for the vignette build
wine.grouped <- speaq::PeakGrouper(Y.peaks = wine.peaks,
min.samp.grp = 5,
grouping.window.width = 200)
There are two notes to be made on some important parameters: 1. The
include_nearbyPeaks = TRUE
option also detects peaks in the
tails of larger other peaks. At first this might seem the preferred
option as you want to detect as much peaks as possible but we will see
later that often it is better to exclude them as these small peaks can
cause problems later on. 2. The grouping.window.width
parameter is chosen to be twice the size of the default value: it is
advised to choose this larger when working with spectra that exhibit
large between sample shifts. The wine dataset is an extreme example of
large between sample shifts (caused by the substantial pH differences
between wines of different colour)
Now we can plot the detected peaks and the grouped peaks. The dataset
after grouping contains both the original ppm values of every peak (in
the peakPPM
variable) but also the group information (found
in the peakIndex
variable). By calling the
AddPlottingStuff()
function the groupPPM variable is added
so we also have the ppm value of the groups (the link is:
groupPPM <- ppm.wine[peakIndex]
)
# adding labels to the dat a for plotting and the group ppm values
library(ggplot2)
ROI.ppm <- 1.330
roiWidth.ppm <- 0.025
speaq::ROIplot(Y.spec = Spectra.wine,
X.ppm = ppm.wine,
ungrouped.peaks = wine.peaks,
grouped.peaks = wine.grouped,
ROI.ppm = ROI.ppm,
roiWidth.ppm = roiWidth.ppm,
groupLabels = as.factor(wine.color))
The plot above shows clearly what the basis of speaq 2.0 does,
convert spectra accurately to peak data and group these peaks. The
quality of the grouping can be checked by plotting the silhouette
values. this can be done with the SilhouetR()
function.
SilhouetteValues <- speaq::SilhouetR(DataMatrix = wine.grouped$peakPPM,
GroupIndices = wine.grouped$peakIndex)
## DataMatrix is not a matrix, attempting conversion with the assumption of only 1 variable (1 column)
Silh_plot <- ggplot(SilhouetteValues, aes(SilhouetteValues)) +
geom_freqpoly(binwidth = 0.03) +
theme_bw()
Silh_plot
It is clear that the grouping is very good. To be absolutely sure we can
check the mean silhouette value of every group to see if there are
groups that should be regrouped. This regrouping, if needed, can be done
with the regroupR()
function.
groups <- unique(SilhouetteValues$GroupIndices)
Ngroups <- length(groups)
sil_means <- matrix(NA, ncol = 3, nrow = Ngroups)
for(k in 1:Ngroups){
sil_means[k,1] = groups[k]
sil_means[k,2] = mean(SilhouetteValues$SilhouetteValues[SilhouetteValues$GroupIndices==groups[k]])
sil_means[k,3] = mean(wine.grouped$peakSNR[wine.grouped$peakIndex==groups[k]])
}
sil_means <- sil_means[order(sil_means[,2]),]
colnames(sil_means) <- c("groupIndex", "avg_silhouette_val", "avg. SNR")
head(sil_means)
## groupIndex avg_silhouette_val avg. SNR
## [1,] 5313 0.2557844 32.261401
## [2,] 5050 0.4549954 11.669748
## [3,] 2330 0.4848325 31.757387
## [4,] 4711 0.4895859 5.754990
## [5,] 5552 0.5011936 6.933832
## [6,] 3287 0.5099114 3.559727
As it turns out, there is a group with a low average silhouette value (0.25) but a fairly high average signal-to-noise ratio (if this group had a low SNR ratio it could just be a noise group). This indicates that these are non-noise peaks which are grouped incorrectly, the following plot acknowledges this
faulty.groupIndex <- sil_means[1,1]
ROI.ppm <- ppm.wine[faulty.groupIndex]
roiWidth.ppm <- 0.1
speaq::ROIplot(Y.spec = Spectra.wine,
X.ppm = ppm.wine,
ungrouped.peaks = wine.peaks,
grouped.peaks = wine.grouped,
ROI.ppm = ROI.ppm,
roiWidth.ppm = roiWidth.ppm,
groupLabels = as.factor(wine.color))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Clearly the large peaks and small peaks are grouped together because of the large shifts in the wine dataset making the small and large peaks overlap, this messes up the grouping. There are multiple ways to solve this issue:
By setting include_nearbyPeaks = FALSE
in the
getWaveletPeaks()
function these small peaks in the tails
of large peaks will not be included, although some other peaks might
also be missed. Or
by selecting the wrong group and submitting them to the
regroupR()
function. This will use both the ppm values and
the peak signal-to-noise ratio for the initial between peak distance
calculation and group them according to this distance.
We will use option 2 in this case to demonstrate the use of the
regroupR()
function. Since the wrong grouping of these two
overlapping groups also causes a clear misgrouping of the next group we
will submit the 3 peak groups (see right half of bottom plot) to the
regroupR()
function to fix the issues.
wrong.groups <- sort(sil_means[sil_means[,1]>=sil_means[1,1],1])[1:2]
wine.regrouped <- speaq::regroupR(grouped.peaks = wine.grouped,
list.to.regroup = wrong.groups,
min.samp.grp = 5,
max.dupli.prop = 0.1)
## 'list.to.regroup' is not a list. Assuming it is a list with only one element.
The plot below reveals the problem is clearly fixed now, but it requires more user interaction than wanted.
faulty.groupIndex <- sil_means[1,1]
ROI.ppm <- ppm.wine[faulty.groupIndex]
roiWidth.ppm <- 0.1
speaq::ROIplot(Y.spec = Spectra.wine,
X.ppm = ppm.wine,
ungrouped.peaks = wine.peaks,
grouped.peaks = wine.regrouped,
ROI.ppm = ROI.ppm,
roiWidth.ppm = roiWidth.ppm,
groupLabels = as.factor(wine.color))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
With this peaks regrouped we can continue the analysis. first peak filling is applied and secondly the grouped peak data is converted to a so called feature matrix with samples for rows and features (peak groups) for columns whereby 1 matrix element indicates the peakvalue for a specific group and sample combo.
wine.filled <- speaq::PeakFilling(Y.grouped = wine.regrouped,
Y.spec = Spectra.wine,
max.index.shift = 50,
nCPU = 1) # nCPU set to 1 for the vignette build
wine.Features <- speaq::BuildFeatureMatrix(wine.filled)
Now that we have these peak data they can either be incorporated in larger metabolomic experiments, where NMR spectroscopy data is often combined with LC-MS data (which is processed always in peak data format), or it can be analysed on its own, as we will demonstrate here.
Now that we have the feature matrix we can quickly perform a PCA
(principal component analysis) as a way of visualizing potential trends
and groups in the data. Before any PCA it is advised to scale and center
the data, here we will use the pareto scaling but other are available
(see the SCANT()
help file)
common.pca <- prcomp(wine.Features.scaled)
loadings <- common.pca$rotation
scores <- common.pca$x
varExplained <- common.pca$sdev^2
barplot(varExplained/sum(varExplained),
main="Scree Plot",ylab="Proportion of variance explained",
xlab = "Principal comonent",
names.arg = as.character(seq(1,length(varExplained))))
plot.marks <- as.numeric(wine.color)
plot.marks[plot.marks == 1] <- 8
plot.marks[plot.marks == 2] <- 15
plot.marks[plot.marks == 3] <- 1
cp1 <- 1
cp2 <- 2
plot(scores[,cp1]/max(scores[,cp1]), scores[,cp2]/max(scores[,cp2]),
main=paste("score plot, PC",cp1," vs. PC",cp2,sep=""),
xlab=paste("PC",cp1,round(varExplained[cp1]/sum(varExplained),digits=2),""),
ylab=paste("PC",cp2,round(varExplained[cp2]/sum(varExplained),digits=2),""),
pch = plot.marks)
text(scores[,cp1]/max(scores[,cp1]),scores[,cp2]/max(scores[,cp2]), wine.color, cex=0.5, pos=4, col="red")
lines(x = c(-100,100), y = c(0,0))
lines(x = c(0,0), y = c(-100,100))
legend("topleft",
legend = c("red ","rosé ","white "),
pch = c(8,15,1),
y.intersp = 1.9)
From the PCA analysis it is already clear that there is a large difference between on the one side red wine and on the other side white and rosé wine.
Next we can use the relevant.features()
function in
speaq which uses linear models to produce a p-value corresponding to the
null hypothesis that feature x is not related to the outcome vector.
After a p-value correction for multiple testing (as with increasing
numbers of even randomly generated features some will eventually be
significantly related to the output vector) we can identify the relevant
features. As noted in the paper this method is perfectly suited for
outcome vectors containing either 2 classes or more classes with a
numeric relationship. This is unfortunately not the case here as red,
white and rosé are 3 categorical variables. Since there are only 2 rosé
wines we will perform the analysis leaving these ones out. This does
require however to rescale the raw data without the rosé wines, as
otherwise they still have an influence on the dataset ### removing the
rose wine
p.all_bonf <- speaq::relevant.features.p(datamatrix = red.white.scaled,
responsevector = red.white.colors,
p.adj = "bonferroni")
## Warning in speaq::relevant.features.p(datamatrix = red.white.scaled,
## responsevector = red.white.colors, : 'responsevector' is deprecated, please use
## the more general 'response' variable.
## [1] "response was transformed to numeric"
significant.features <- p.all_bonf[p.all_bonf$p.values<=0.05, ]
# order from most significant
significant.features <- significant.features[order(significant.features[,2]),]
head(significant.features)
## index p.values
## 126 5170 2.880545e-18
## 125 5146 1.006329e-12
## 201 7867 3.355267e-07
## 93 4201 8.326499e-07
## 83 3852 2.797353e-06
## 124 5121 7.234151e-06
It is clear that there are features well below the 0.05 p-value threshold. This was to be expected as the difference between the two classes was already apparent in the PCA analysis. Often when the PCA analysis does not reveal obvious results the differential analysis bases on linear models will nonetheless find significant features. We can now easily find which features correspond to the low p-values and consequently plot them.
peak_of_interest <- 5 # change this to the peak you want
interest.groupIndex <- significant.features$index[peak_of_interest]
interest.peakIndex <- as.numeric(rownames(significant.features))[peak_of_interest]
ROI.ppm <- ppm.wine[interest.groupIndex]
roiWidth.ppm <- 0.03
ggplot(p.all_bonf, aes(x=as.numeric(rownames(p.all_bonf)), y= -log10(p.values) )) +
geom_point(data = p.all_bonf[-interest.peakIndex,],
aes(x=as.numeric(rownames(p.all_bonf[-interest.peakIndex,])), y= -log10(p.values) ),
shape = 16) +
geom_point(data = p.all_bonf[interest.peakIndex,], aes(x=interest.peakIndex, y= -log10(p.values) ),
shape = 18,
size = 3,
colour ="#00B0F6" ) +
xlab("feature index") +
ylab("- log10 p-value") +
ggtitle("Bonferroni corrected p-values") +
geom_hline(aes(yintercept= -log10(0.05), color="red"),linetype = 2) +
guides(color=FALSE) +
theme_bw() +
theme(plot.title = element_text(lineheight = 0.8, face="bold", margin = margin(12,0,13,0),hjust = 0.5, size = 15),
text = element_text(size=14))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
speaq::ROIplot(Y.spec = Spectra.wine,
X.ppm = ppm.wine,
ungrouped.peaks = wine.peaks,
grouped.peaks = wine.regrouped,
ROI.ppm = ROI.ppm,
roiWidth.ppm = roiWidth.ppm,
groupLabels = as.factor(wine.color))
Or just plot the spectra of a single feature
peak_of_interest <- 4# change this number to any of the peaks you want to see
drawSpecPPM(Y.spec = Spectra.wine[wine.color != "rose", ],
X.ppm = ppm.wine,
groupFactor = red.white.colors,
title = paste("significant feature, p-value =",
format(significant.features$p.values[peak_of_interest],
scientific = TRUE,
digits=2),
sep=" "),
legend.extra.x = 1.1,
legend.extra.y = 0.9,
ROI = significant.features$index[peak_of_interest],
roiWidth = 100,
legendpos = "topright" )
Now that we have identified relevant features we can use the original
speaq
package to plot the spectra more clearly.
Specifically we can use the dohCluster()
function from
speaq
to align the raw spectra and consequently plot these
with the drawSpecPPM()
function as in the plot above.
peakList <- speaq::detectSpecPeaks(as.matrix(Spectra.wine),
nDivRange = 128,
scales = seq(1, 16, 2),
baselineThresh = 100,
SNR.Th = -1,
verbose=FALSE)
resFindRef <- findRef(peakList)
refInd <- resFindRef$refInd
aligned.spectra <- speaq::dohCluster(as.matrix(Spectra.wine),
peakList = peakList,
refInd = refInd,
maxShift = 200,
acceptLostPeak = TRUE, verbose=FALSE)
speaq::drawSpecPPM(Y.spec = aligned.spectra[wine.color != "rose", ],
X.ppm = ppm.wine,
groupFactor = red.white.colors,
title = paste("significant feature, p-value =",
format(significant.features$p.values[peak_of_interest],
scientific = TRUE,
digits=2),
sep=" "),
legend.extra.x = 1.1,
legend.extra.y = 0.9,
ROI = significant.features$index[peak_of_interest],
roiWidth = 100,
legendpos = "topright" )