Saturday, March 23, 2013

Step 3. SumMatrix is defined

Is there any meaningful use here?

Previously: Step 1. Digitizing
Previously: Step 2. Visualization

Each element in the matrix represents sequence of 8 nucleotides. 

Thus point with coordinates (X,Y) = Hex (80,7F)  = Quaternary (2000, 1333)  represents sequence ATTTCGGG



Each dot on the matrix grid represents sum of all occurrences of the sequence in the genome.

That is why I named it
SumMatrix

A SumMatrix is a statistical signature of a genome.
A SumMatrix presents the frequency distribution of 8-letter codons in a genome.
The bigger the value in a cell of the matrix, the darker it looks in the SumMatrix presentation, the more often the combination occurs in a genome.


But the visual patterns gave me a feeling that the less occurred sequences must be more interesting.
Some combinations a rare, some of them are extremely rare, and some, probably, do not exist at all.


What are these rare combinations?
Result of mutations for better organization?
Are they rarely occurred markers which separate important chain of nucleotides?
Or, maybe, they are simply errors in genome sequencing?
Or they are something else?
Or they are all of the above?


Investigation is required, lots of digital analyzing, and new ideas are needed as well.


SumMatrix normalization

Each cell of a SumMatrix is a counter of correspondent 8-letter codone in a genome.

Absolute numbers are not very suitable for statistical analysis: different chromosomes of different species contain very different number of nucleotides. So absolute counters of sequence, say, 7F7F are very different for different genomes.


For comparison and other digital processing I need relative numbers which are not dependent on the length of a genome. After some experiments I decided to use "median" approach: all incoming values I sort and then divide into same-size groups. Then I use group number instead of absolute value of a cell.


Such simplified matrix I call "normalized SumMatrix".

I decided that number of groups (and thus values in the normalized SumMatrix) will be power of 2:
Level 1 SumMatrix contains values 0 and 1 only.
Level 2 SumMatrix contains values 0 thru 3.
. . .
Level 8 SumMatrix contains values 0 thru 255


The higher level of a normalized SumMatrix, the higher accuracy of a genome presentation.
I decided that having maximum 8 levels is enough, simply because in this case I can easily present a value as a shade of gray.


Here is a visual comparison of "full" and "normalized" SumMatrix:


6400K fragment of "full" SumMatrix
Normalized SumMatrix level 8
















Normalized SumMatrix is definitely less detailed but should be much better for digital processing.

To be continued...

No comments:

Post a Comment