Progenesis SameSpots

A major advance for 2D analysis
Find out what's really going on in your proteomics data...

Download

Normalisation

Normalisation is required in proteomics experiments to calibrate data between different sample runs. This corrects for factors that result in experimental variation when running 2D. Such factors can range from sample quantity to scanner settings to labelling of samples. The effect of these systematic factors can be modelled by a unique gain factor for each sample. The gain factor is represented by a scalar multiple that is applied to each feature abundance measurement. The aim of normalisation is to determine this gain factor for each sample.

This can be modelled as follows:

y'i = αkyi

where yi is the measured abundance of feature i on sample k, 1/αk is the gain factor for sample k and y'i is the normalised abundance of feature i on sample k.

The usual approach to normalisation is to fix the data of one sample and then to calibrate all other sample data to this reference. We can think of this reference sample as having a gain factor of 1. Therefore, to implement normalisation, we need to choose a reference and then calculate gain factors for all other samples. For DiGE and some other multiplex approaches, the normalisation reference is fixed within each multiplex, e.g. for DiGE it is the internal standard, usually the cy2 image. For Single Stain, we need to choose a reference sample for normalisation.

The improvement to normalisation in Progenesis SameSpots v3.0 has been in the calculation of the gain factor. In previous versions, the assumption was that total spot volume should be equal across all samples. This assumes that a large majority of proteins are unchanging with respect to the experimental conditions under investigation. Also, this assumes that protein up-regulations and down-regulations would balance out. It should be pointed out that the Total Spot Volume approach has been accepted as a standard normalisation technique for many years. However, in light of the benefits to analysis due to the SameSpots paradigm, we felt it was possible to improve on normalisation. The result is a more robust algorithm that can be successfully applied under weaker assumptions of feature abundance differences between samples.

As SameSpots reduces feature abundance variability and results in a complete data set, we can confidently apply this new normalisation technique even in situations where the abundance of many of the features are changing between conditions and also with data where 'sparse' (small population of expressed features) images are being compared with images containing a much greater number of features.

Of course, for any proteomics data set the base assumption still remains that enough features should NOT be changing in abundance to allow the use of these features to normalise between images.

As stated earlier, normalisation requires that we determine gain factors for each sample apart from the reference sample whose gain factor is set to 1. For DiGE data sets, the internal standard is chosen as the reference. For Single Stain data sets the sample that gives the most stable normalisation for all other samples is chosen as the reference. In other words, the reference is that sample that is least different from all other samples.

Then, for a given sample, normalisation proceeds as follows. For each feature, we calculate the background corrected abundance and determine an abundance ratio by dividing the sample abundance by the reference abundance. The assumption that many features are unchanging implies that a scatter plot of the ratios should be distributed about 1, i.e. abundances are equal and therefore the ratio is equal to 1. However, due to the aforementioned gain factor, we see that the values are generally distributed about a value β different to 1. By calculating this value β we can determine the gain factor and normalise the sample values.

This approach to normalisation can be made more robust by working in Log space, i.e. we look at the distribution of the Log of the abundance ratios [L(AR)'s]. This is advantageous because outlier ratios will have a reduced effect and up-regulation can be treated in an equivalent way to down-regulations when determining the gain factor in Log space.

Visually, the effect of normalisation is also more obvious in Log space, where the gain factor corresponds to a shift in the mean L(AR) to zero. So, to calculate the gain factor, we need to determine the constant that must be added to the mean value of the L(AR)'s to shift this mean to zero. To reduce the influence of outliers, the mean value L(AR) is calculated using a recursive median approach.

As a practical example, Figure 1 shows a scatter plot of abundance ratios in Log space. The dotted blue line at zero indicates the value about which we expect the ratios to be distributed. The red line indicates the robust mean value about which the ratios are actually distributed. In order to normalise, we shift all ratios by a constant factor so that they are now distributed about zero as shown in Figure 2. This is equivalent to multiplying by a constant factor in the original (i.e. non Log ) space.

Scatter plot of Log(Abundance Ratio) indicating the required shift to
        normalise the sample data

Figure 1. A scatter plot of Log(Abundance Ratio) indicating the required shift to normalise the sample data

Scatter plot of Log(Abundance Ratio) after normalisation

Figure 2. A scatter plot of Log(Abundance Ratio) after normalisation

For DIGE data, the normalisation proceeds in the same general way as described above. In other words, a gain factor needs to be determined for each sample. However, there are a number of differences. The internal standard is chosen as a reference for each multiplex. Also, the normalised values are presented as ratios (ratiometric normalisation).