Sunday, March 24, 2013

Step 4. Video pause




Previously: Step 1. Digitizing
Previously: Step 2. Visualization
Previously: Step 3. SumMatrix is defined


I believe that SumMatrix is the first step for understanding #Genome as a #Program.
SumMatrix is statistical signature of a genome.

"The Matrix" was fancy fantasy. 
I discovered that all living creatures from viruses to Homo Sapiens are based on the matrix with recognizable patterns.

I think that digital era of genomics starts here.
You can be the very first witnesses of this.

Watch SumMartix for different chromosomes of different species. 
Notice, that Human statistically closer to Influenza than to Honey Bee.




Saturday, March 23, 2013

Step 3. SumMatrix is defined

Is there any meaningful use here?

Previously: Step 1. Digitizing
Previously: Step 2. Visualization

Each element in the matrix represents sequence of 8 nucleotides. 

Thus point with coordinates (X,Y) = Hex (80,7F)  = Quaternary (2000, 1333)  represents sequence ATTTCGGG



Each dot on the matrix grid represents sum of all occurrences of the sequence in the genome.

That is why I named it
SumMatrix

A SumMatrix is a statistical signature of a genome.
A SumMatrix presents the frequency distribution of 8-letter codons in a genome.
The bigger the value in a cell of the matrix, the darker it looks in the SumMatrix presentation, the more often the combination occurs in a genome.


But the visual patterns gave me a feeling that the less occurred sequences must be more interesting.
Some combinations a rare, some of them are extremely rare, and some, probably, do not exist at all.


What are these rare combinations?
Result of mutations for better organization?
Are they rarely occurred markers which separate important chain of nucleotides?
Or, maybe, they are simply errors in genome sequencing?
Or they are something else?
Or they are all of the above?


Investigation is required, lots of digital analyzing, and new ideas are needed as well.


SumMatrix normalization

Each cell of a SumMatrix is a counter of correspondent 8-letter codone in a genome.

Absolute numbers are not very suitable for statistical analysis: different chromosomes of different species contain very different number of nucleotides. So absolute counters of sequence, say, 7F7F are very different for different genomes.


For comparison and other digital processing I need relative numbers which are not dependent on the length of a genome. After some experiments I decided to use "median" approach: all incoming values I sort and then divide into same-size groups. Then I use group number instead of absolute value of a cell.


Such simplified matrix I call "normalized SumMatrix".

I decided that number of groups (and thus values in the normalized SumMatrix) will be power of 2:
Level 1 SumMatrix contains values 0 and 1 only.
Level 2 SumMatrix contains values 0 thru 3.
. . .
Level 8 SumMatrix contains values 0 thru 255


The higher level of a normalized SumMatrix, the higher accuracy of a genome presentation.
I decided that having maximum 8 levels is enough, simply because in this case I can easily present a value as a shade of gray.


Here is a visual comparison of "full" and "normalized" SumMatrix:


6400K fragment of "full" SumMatrix
Normalized SumMatrix level 8
















Normalized SumMatrix is definitely less detailed but should be much better for digital processing.

To be continued...

Friday, March 22, 2013

Step 2. Visualization

How to show 15 million numbers?

Previously: Step 1. Digitizing
Now I have my binary file chrY.fa.tcag.n-t.hex.
Some sort of presentation of all these bunch of numbers would be very useful.
Do you have any idea? I did not.

So, I tried the following: take 2 consecutive bytes and pretend that they are X and Y coordinates on a surface. 
This way I would have a square 256 x 256 (FF x FF in hex) with both coordinates changing from 0 to FF.
I took first 64K hex numbers from the file and put a dot in the position defined by the coordinates. The more dots was drawn at the same coordinates, the darker they look.
64K hex numbers represent quarter a million of nucleotides. 

Frankly, I expected to see some randomly spreaded dots.


Here was the result:

OK. This was kind of expected. I clicked a button in my program a few times more, each time adding another 64K numbers.

256K: are there some pattern there or this is just my imagination?

I started to add more and more data.
704K:
1344K:

1984K:

4416K:

Wow! This is incredible! This is human matrix. 
Have I discovered something amazing?
Fills even scary.
I stopped. I need to save my breath. I need somehow to comprehend the result.
Human X-cromosome
Click the image to see it in action

First I decided to compare the picture with another binary files. Maybe I do not know something about my newly created way of file presentation. Maybe they all look the same way.

Let's take, say, zip file.
Looks pretty random to me. Better I try exe or dll. They are programs too after all.
Looks cool but very far from human matrix.

I decided to try different species. 
Is this matrix uniquely human or what?

First I found a mouse.
No, we are not much different from a mouse.

What about, say, lizard?
Wow! Lizard has the same pattern too.
I tried cow, dog, lancelet (primitive fish-like creature), blowfish, wild boar, and orangutan. 
They all have the same pattern!

Different pattern I found in worms:
which resembles a pattern of mosquito:
and honey bee:

I'm hooked. I want to try everything.

Bacteria. E.Coli:


Influenza:
It looks closer to human than to bee or E.Coli.

Let me try plants.
Glicine:

I believe that you've got the idea. I have.

To be continued...

Thursday, March 21, 2013

Step 1: Digitizing.

Genome encrypted in DNA is a program. 

This is well known fact, at least for me.
But what is its instruction set? 
First, I need to find out what digits are used for the instructions.

I know nothing about biology and chemistry.
I forgot most of my Math.

I just like to play with numbers.

That is why I was very intrigued when I found out that genome contains only four nucleotides: A, C, G, T.
This is amazing! Not 3, not 7 or 10, or 255.
Just 4!

This is perfect for 4 digits in Quaternary (base-4) system.            


Note for less digitally-literate audience.
Wildly  used digital systems are:
- Decimal (base-10) has 10 digits: 0,1,2,3,4,5,6,7,8,9
- Binary (base-2) has 2 digits: 0,1
- Octal (base-8) has 8 digits: 0,1,2,3,4,5,6,7
- Hexadecimal (base-16) has 16 digits:   0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F

So for Quaternary (base-4) system I just need to assign digits 0 to 3 for the known nucleotides.
I decided that the most natural way of arranging the nucleotides is by their weight.

Thus, here is the assignment:

T - 0
C - 1
A - 2
G - 3
I can use 2 bits to encode one nucleotide.

Now sequence
AGTGTATAGGAATGCAAGGGATTTCTGTGTGT
I can write in Quaternary system as
23030202332203122333200010303030
or in Binary system
1011001100100010111110100011011010111111100000000100110011001100
or in Hexadecimal system
B322FA36BF804CCC

Side note
By the way, this is the beginning sequence of the X chromosome of a cow
(taken from Bos_taurus.UMD3.1.68.dna.chromosome.X.fa marked as
   "X dna:chromosome chromosome:UMD3.1:X:1:148823899:1 REF" whatever it means)


I downloaded a few genomes in FASTA format and transformed them into HEX format.
As the sources I used the following FTP sites:

ftp://hgdownload.cse.ucsc.edu/apache/htdocs/goldenPath/currentGenomes/
ftp://ftp.ensembl.org/pub/mnt2/release-70/fasta/
ftp://ftp.ncbi.nlm.nih.gov/genomes/

Definitely, I have started from most intrigued Human Y-chromosome. The result of transformation you can pick up from this folder on my Google Drive.
The original text file chrY.fa was about 60MB.
The resulting binary file chrY.fa.tcag.n-t.hex is about 15MB.
Side note
"hex" in the extension of the file name indicates that this is binary file.
"n-t" indicates that I replaced all N(unknown) with T. I needed to make this 
replacement: there is no N nucleotide after all.
"tcag" indicates how I translated symbols to digits, in this case T=0,C=1,A=2,G=3

Size of the resulting file is 4 times less than the original one.
This is because each symbol (one byte = 8 bits) in the original file now occupies 2 bits only,
and each byte of the result represents 4 original symbols.

Now, when I have 15 millions hexadecimal numbers, I can analyze them digitally. I have no idea what to do with them.

Now, when I know coding base of The Program, I can try to look for something interesting in these numbers. 
Maybe there are some fundamental constants hidden there?
Say, π or e or the most important of all "The golden ratio φ"?
There is one problem: I do not know how to encode real (non-integer) numbers in hexadecimal format.
I'll think about this later...

Brief summary of the Step 1

The Life has Quaternary (base-4) system as its fundamental base.
Let's not create some artificial "genome coding" with bunch of silly symbols like S=CG or D=AGT.

I saw that there are dozens of formats for genome storage.
Let's store all genetic information in one unified format - the original format it was created. Look out of the window: it's digital age over there!

There are some Pro and Con of using binary format as genomic storage. Binary genomic format is preciseThis is its strength,  this is its weakness. 

Inside the file there is no place for comments, for Unknown pieces, lacuna, for results of explorations, for grouping, etc.
All this additional information must be stored separately, with indexes/pointers to the original binary file. Creators and lovers of new formats have broad field of possibilities, as broad as their imagination.

But binary files have much more power for digital processing and analyzing. 

So,
Let's store all genetic information in one unified format - binary, as dictated by the fundamental base.

To be continued...