How does database fragmentation scoring work?

To score database fragmentation matches, we use an algorithm based on the well adopted cosine similarity method. A similar method is, for example, implemented by MassBank [pdf].

Cosine similarity method

The dot product of two 2-dimensional vectors, ${\bf x} = x_1 {\bf i} + x_2 {\bf j}$ and ${\bf y} = y_1 {\bf i} + y_2 {\bf j}$ is:

${\bf x} \cdot {\bf y} = x_1 \times y_1 + x_2 \times y_2$

It can also be expressed as:

${\bf x} \cdot {\bf y} = |{\bf x}| |{\bf y}| \cos(\theta)$

Where $\theta$ is the angle between the two vectors, and $|{\bf x}| = \sqrt{x_1^2 + x_2^2}$ .

By equating these two formulae, the "similarity" between the two vectors is given by the cosine of the angle between them, which has the nice property that it ranges from 0 to 1 when all co-efficients are positive:

$\mathrm{similarity} = \cos(\theta) = \frac{{\bf x} \cdot {\bf y}}{|{\bf x}||{\bf y}|}$

This method can also be expanded to n-dimensional vectors:

$\mathrm{similarity} = \frac{\sum_{i=1}^{i=n} x_i \times y_i}{\sqrt{\sum_{i=1}^{n} x_i^2}\sqrt{\sum_{i=1}^{n} y_i^2}}$

A similarity of 1 means the two vectors are identical, and a similarity of 0 means they are orthogonal and independent of each other.

Cosine similarity method applied to ms/ms scoring

We apply this method to scoring of ms/ms database matches as follows.

We create two vectors ${\bf E}$ and ${\bf D}$ , where each element of the vector is a weighted peak intensity given by:

$W_i = (\mathrm{m/z\ of\ i^{th}\ peak})^2\sqrt{\mathrm{intensity\ of\ i^{th}\ peak}}$

We combine all m/z's of peaks from the experimental and database spectra, and go through them in ascending m/z order. For each m/z, there are 3 possibilities:

There is an experimental peak at the given m/z, but no matching database peak.
There is a database peak at the given m/z, but no matching experimental peak.
There is an experimental peak at the given m/z, and a database peak at the same m/z (to within a threshold).

For each of these scenarios, we add elements to the vectors ${\bf E}$ and ${\bf D}$ as follows:

We add the weighted experimental peak intensity to ${\bf E}$ and a 0 to ${\bf D}$ .
We add a 0 to ${\bf E}$ and the weighted database peak intensity to ${\bf D}$ .
We add the weighted experimental peak intensity to ${\bf E}$ and the weighted database peak intensity to ${\bf D}$ .

Finally, we calculate the similarity metric on ${\bf E}$ and ${\bf D}$ as defined above. To obtain a score between 0 and 100, we multiply this result by 100.

Example

To illustrate this method, suppose we have the following experimental and database spectra:

Example experimental spectrum

Example database spectrum

In this case, the two vectors produced are as follows (where $W(m,i) = m^2 \sqrt{i}$ is the weighted intensity function):

${\bf E} = \begin{bmatrix} W(40,40)\\ W(100, 60)\\ 0\\ W(300, 80)\\ 0\\ W(440, 20)\\ \end{bmatrix} {\bf D} = \begin{bmatrix} W(40,30)\\ 0\\ W(200, 10)\\ W(300, 90)\\ W(380, 10)\\ W(440, 20)\\ \end{bmatrix}$

The similarity metric is then:

$\begin{align*} \mathrm{similarity} &= \frac{[W(40,40) \times W(40, 30)] + [W(300, 80) \times W(300, 90)] + [W(440, 20) \times W(440, 20)]}{\sqrt{W(40, 40)^2 + W(100,60)^2 + W(300,80)^2 + W(440, 20)^2}\\ \times\sqrt{W(40,30)^2 + W(200, 10)^2 + W(300, 90)^2 + W(380, 10)^2 + W(440, 20)^2}}\\ &\approx 0.93 \end{align*}$

So these two spectra will be given a fragmentation score of ~93 - they are fairly well matched, but there are a few peaks which are either not matched, or not expected to be present, lowering its score.