 Progenesis QI

The next generation in LC-MS data analysis software.
Discover the significantly changing compounds in your samples. How does database fragmentation scoring work?

To score database fragmentation matches, we use an algorithm based on the well adopted cosine similarity method. A similar method is, for example, implemented by MassBank [pdf].

Cosine similarity method

The dot product of two 2-dimensional vectors, ${\bf x} = x_1 {\bf i} + x_2 {\bf j}$ and ${\bf y} = y_1 {\bf i} + y_2 {\bf j}$ is:

It can also be expressed as:

Where $\theta$ is the angle between the two vectors, and $|{\bf x}| = \sqrt{x_1^2 + x_2^2}$.

By equating these two formulae, the "similarity" between the two vectors is given by the cosine of the angle between them, which has the nice property that it ranges from 0 to 1 when all co-efficients are positive:

This method can also be expanded to n-dimensional vectors:

A similarity of 1 means the two vectors are identical, and a similarity of 0 means they are orthogonal and independent of each other.

Cosine similarity method applied to ms/ms scoring

We apply this method to scoring of ms/ms database matches as follows.

We create two vectors ${\bf E}$ and ${\bf D}$, where each element of the vector is a weighted peak intensity given by:

We combine all m/z's of peaks from the experimental and database spectra, and go through them in ascending m/z order. For each m/z, there are 3 possibilities:

1. There is an experimental peak at the given m/z, but no matching database peak.
2. There is a database peak at the given m/z, but no matching experimental peak.
3. There is an experimental peak at the given m/z, and a database peak at the same m/z (to within a threshold).

For each of these scenarios, we add elements to the vectors ${\bf E}$ and ${\bf D}$ as follows:

1. We add the weighted experimental peak intensity to ${\bf E}$ and a 0 to ${\bf D}$.
2. We add a 0 to ${\bf E}$ and the weighted database peak intensity to ${\bf D}$.
3. We add the weighted experimental peak intensity to ${\bf E}$ and the weighted database peak intensity to ${\bf D}$.

Finally, we calculate the similarity metric on ${\bf E}$ and ${\bf D}$ as defined above. To obtain a score between 0 and 100, we multiply this result by 100.

Example

To illustrate this method, suppose we have the following experimental and database spectra:  In this case, the two vectors produced are as follows (where $W(m,i) = m^2 \sqrt{i}$ is the weighted intensity function):

The similarity metric is then:

So these two spectra will be given a fragmentation score of ~93 - they are fairly well matched, but there are a few peaks which are either not matched, or not expected to be present, lowering its score.