Thursday, March 21, 2013

Step 1: Digitizing.

Genome encrypted in DNA is a program. 

This is well known fact, at least for me.
But what is its instruction set? 
First, I need to find out what digits are used for the instructions.

I know nothing about biology and chemistry.
I forgot most of my Math.

I just like to play with numbers.

That is why I was very intrigued when I found out that genome contains only four nucleotides: A, C, G, T.
This is amazing! Not 3, not 7 or 10, or 255.
Just 4!

This is perfect for 4 digits in Quaternary (base-4) system.            


Note for less digitally-literate audience.
Wildly  used digital systems are:
- Decimal (base-10) has 10 digits: 0,1,2,3,4,5,6,7,8,9
- Binary (base-2) has 2 digits: 0,1
- Octal (base-8) has 8 digits: 0,1,2,3,4,5,6,7
- Hexadecimal (base-16) has 16 digits:   0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F

So for Quaternary (base-4) system I just need to assign digits 0 to 3 for the known nucleotides.
I decided that the most natural way of arranging the nucleotides is by their weight.

Thus, here is the assignment:

T - 0
C - 1
A - 2
G - 3
I can use 2 bits to encode one nucleotide.

Now sequence
AGTGTATAGGAATGCAAGGGATTTCTGTGTGT
I can write in Quaternary system as
23030202332203122333200010303030
or in Binary system
1011001100100010111110100011011010111111100000000100110011001100
or in Hexadecimal system
B322FA36BF804CCC

Side note
By the way, this is the beginning sequence of the X chromosome of a cow
(taken from Bos_taurus.UMD3.1.68.dna.chromosome.X.fa marked as
   "X dna:chromosome chromosome:UMD3.1:X:1:148823899:1 REF" whatever it means)


I downloaded a few genomes in FASTA format and transformed them into HEX format.
As the sources I used the following FTP sites:

ftp://hgdownload.cse.ucsc.edu/apache/htdocs/goldenPath/currentGenomes/
ftp://ftp.ensembl.org/pub/mnt2/release-70/fasta/
ftp://ftp.ncbi.nlm.nih.gov/genomes/

Definitely, I have started from most intrigued Human Y-chromosome. The result of transformation you can pick up from this folder on my Google Drive.
The original text file chrY.fa was about 60MB.
The resulting binary file chrY.fa.tcag.n-t.hex is about 15MB.
Side note
"hex" in the extension of the file name indicates that this is binary file.
"n-t" indicates that I replaced all N(unknown) with T. I needed to make this 
replacement: there is no N nucleotide after all.
"tcag" indicates how I translated symbols to digits, in this case T=0,C=1,A=2,G=3

Size of the resulting file is 4 times less than the original one.
This is because each symbol (one byte = 8 bits) in the original file now occupies 2 bits only,
and each byte of the result represents 4 original symbols.

Now, when I have 15 millions hexadecimal numbers, I can analyze them digitally. I have no idea what to do with them.

Now, when I know coding base of The Program, I can try to look for something interesting in these numbers. 
Maybe there are some fundamental constants hidden there?
Say, π or e or the most important of all "The golden ratio φ"?
There is one problem: I do not know how to encode real (non-integer) numbers in hexadecimal format.
I'll think about this later...

Brief summary of the Step 1

The Life has Quaternary (base-4) system as its fundamental base.
Let's not create some artificial "genome coding" with bunch of silly symbols like S=CG or D=AGT.

I saw that there are dozens of formats for genome storage.
Let's store all genetic information in one unified format - the original format it was created. Look out of the window: it's digital age over there!

There are some Pro and Con of using binary format as genomic storage. Binary genomic format is preciseThis is its strength,  this is its weakness. 

Inside the file there is no place for comments, for Unknown pieces, lacuna, for results of explorations, for grouping, etc.
All this additional information must be stored separately, with indexes/pointers to the original binary file. Creators and lovers of new formats have broad field of possibilities, as broad as their imagination.

But binary files have much more power for digital processing and analyzing. 

So,
Let's store all genetic information in one unified format - binary, as dictated by the fundamental base.

To be continued...

No comments:

Post a Comment