SumMatrix of life: Genome as a program : 2013

Friday, May 31, 2013

Life before Earth

Interesting article I bumped into during writing "Gemone as a program"
Life before Earth

By definition, Glossary is alphabetical list of definitions.
Here I'd like to keep a list of genetic terms not in alphabetical order but in an order of my learning.
That is why I called it "Unglossy".

This post will be modified often.

DNA. Along with RNA and proteins, DNA is one of the three major macromolecules essential for all known forms of life. Genetic information is encoded as a sequence of nucleotides (guanine, adenine, thymine, and cytosine) recorded using the letters G, A, T, and C.

Digital Genetics (playlist). 3-parts presentation on "Genome as a program"

Monday, April 22, 2013

10. Digital Genetics is born

4/19/2013. Mediterranean sea, near coast of Spain.

"Digital Genetics" is a preliminary name for a new scientific discipline that would use Digital/Informational Technology methods and approaches, IT way of thinking, to explore genetic tasks from informational point of view.

Discovered SumMatrix patterns in genomes are to be explained. Biological sense of the patterns need to be found. This may lead to re-classification of the living species by the patterns.

Genomic entities, like chromosome, gene, junk DNA, mRNA join, etc., need to be defined digitally. Tools for digital search of genes (at least) should be created.

SumMatrix analysis is needed for
- comparison genomes of different species
- comparison chromosomes of one genome
- comparison of different iterations of a DNA/RNA sequencing
- adding / subtracting, say, virus or multiple viruses genomes to another ones
- many more, I even can't imagine right now

The ultimate goal is to define the instruction set and architecture of the biological processor which we call "life".

* * *

The immediate things that could be done right now are

- creating tools for re-digitazing genetic databases into hex format
- creating tools for calculating SumMatrix and its normalization
- creating tools for SumMatrix comparison and presentation the result of the difference
- creating tools for analysing SumMatrix by patterns
- classify SumMatrix patterns, try to find how many patterns exit
- create centralized online resource for communications, coordination of actions and presenting the results of reseaches
- re-think genetic code table of amino acid production, probably re-classify the resulting proteomes
and many more...

From us, software/IT professionals to them, genetics and biologists:

Let us help you to help all of us.

April 22, 2013. Barcelona

Saturday, April 20, 2013

9. This reminds me something

8. How to explore unknown program for unknown operating system running unknown instruction set on unknown PC

4/14/2013 mid-Atlantic

I try to find analogy for exploring genome as a program.
Like any other analogy this one is as good as it is bad.

Imagine highly advanced non-biological alien race, say "Crystal entity", or The Crystallites came to the Earth unnoticed to explore local life.

They discovered that live on Earth appeared approximately 2.2e21 picoseconds ago (about 70 of our years) though paleological researchers discovered very early sings of life as far as 4.7e21 ps ago. The Crystallites scientists still argue if this mechanical garbage (known to us as Babbage garbage) is possible to consider a life, especially because it never really came to life.

The evolution of the life came long way starting from elementary monster-like electronic bulb species with just a few codes in their instruction set to sophisticated all-in-one creatures that formed current biosphere on Earth (internet in our terms).

Despite of all wide differences in size, shape, color, architecture, and technology they had one thing in common: all signals, commands and whatever, was organized in bytes of 8 bits. Though there were some died-out branches of the evolution that used 3-state bits or 6-bit bytes, they ... died out in the fight for survival. Thus 8-bit byte is the minimal unit of any physical signal, command or data. Sure that smaller portion of a byte could have some meaning, say half-byte (hex) or even 1 bit (binary), but minimal stored/processed unit is still the byte. The consequtive bytes can be combined into larger but still elementary units by 2, 4, 8, even 16 bytes (what we call "a word"). The evolution of life knows attempts to extend a word with extra byte (tag for describing the word and for error tolerance) or even 1-2 extra bits to distinguish command from data and for error checking. But looks like none of them passed this fight for survival. Looks like life prefers to be more economical than error-free.

The Crystallites (TCs for short) started to explore different species (which we call mainframes, PCs, tablets, smart phones, printers, scanners, TVs, players, etc.), the whole zoo of the life forms. They found out that every living organism contains big blobs of digits which they decided to call, say, chromosomes (and what we call programs). They assign a number for each known for them chromosome and started expore each of them separately.

First they sequenced as much chromosomes as possible to look at the hex content.

Then TCs discovered that every chromosome has common for different species parts which they decided to call, say, chromosome region (and what we could imagine as statically embedded DLLs). They discovered that each region contains coding fragments each of which they decided to call, say, gene (function in our world).

TCs decided to find out what each gene is responsible for. So, they cutout a single gene and put it for processing. They noticed that if they start execution from particular offset, in 85% of the experiments speakers start to produce a sound of random frequency (they are not aware of parameters). They even found that the function has multiple entry points with multiple exit points. So they classified the gene as belonging to a sound phenotype and conveniently named it RFSP aka random-frequency-sound-producer. Tremendous work was done trying to find a functional responsibility of as many genes as possible.

Isn't it silly way of exploring an unknown?
Probably, not.
How else you can explore a black box you have no idea about?

So far, so good.

There is a catch. It is very easy to damage a gene, so it doesn't work properly.
But it is very hard to create meaningful new or modify existing gene to produce the result they would need. Say, modify RFSP gene so it would produce 440Gz sound. It is possible. They can try to change a value in each byte of the gene, eventually replaced a parameter with constant 440. The gene does what they wanted. But the living specie starts beeping instead of talking.

So, genetic modification is extremely dangerous and may lead to unpredictable results if their knowledge is based on experimental data only. Knowledge of the instruction set and common structure (architecture) is required to get any meaningful result. Understanding in this area would give absolutely astonishing possibilities. But who knows when these breakthroughs may happened? You can spend many e20 ps and do not have any meaningful result after all.

So, The Crystallites decided to explore damages in genes that happens, though rarely, during copying of the chromosome regions from one specie to another.
They noticed that if gene CNSS aka create-new-SPREADSHEET is losing one bit at the position 284, only halve of the expected spreadsheet appears on screen. They named the mutation 1-284-0, classified the genetic disorder as SPREADSHITTING DISEASE, put descriptions in all their databases, and continue to explore another damages.

Because The Crystallites do not recognize our biological life, just don't see it, unable to see, they probably consider evolution of machine life as series of random changes, mutations, which in zillions of generations perfected in the fight for survival. They may theorize about different directions in this evolution, discuss why some branches of species are died out. They may find a proof of energy crisis which hit the Earth in the pre-historic time and wiped out less energy-efficient monstrous species.
They witnessed the birth of tiny 8-bit species which took over the world after all.

They may argue that such an evolution cannot be random, there should be a master programmer, a god behind all this. They my discuss evolutionism vs. creationism. Uh-h-h. It's getting scary. I'd rather not elaborate this further.

I think you've got the idea.
Use your imagination.

But please remember what I started from:
This analogy is just as good as it is bad. :-)

Even with all oversimplification, somehow, I feel that genetic science went through the similar path.

6. I am not afraid to sound silly

4/13/2013 mid-Atlantic

I am not afraid to sound silly

As an amature explorer without deep knowledge of the subject I do not have any boundaries of well established knowledge. My boundaries are my imagination only. I am not afraid of telling silly (from the official science point of view) things if I found them reasonable. Sure I should not rely on changing of fundamental physical constants though.

When I started my readings on genomics I frankly was very surprised, to put it mildly.

Last 50-60 years transformed digital technology into mature science with well defined and well established organization, in all facets of the industry from chemical and physical achievements to data processing and distribution and graphical interface.
I am happy that most of this timeframe I witnessed myself and was part of this developments myself too.
No doubt there were lots of quarrels as well as mutual agreements for coding, protocols, and other subjects of mutual interests.

Side notes.
1. I remember existence of 9 different code tables for Cyrillic alphabet.
2. When first prototypes of the Soviet supercomputer Elbrus were built someone decided to make it super-Russian as well. So they renamed hexadecimal digits
A B C D E F into Cyrillic
А Б В Г Д Е
and we had two consols: one with latin hexadecimals and another with Cyrillic.
Thus, say, 7BDF on latin console looked as 7БГЕ on the Cyrillic one.
After all the international standard was picked up. :)

What has helped? RFCs? Maybe informational technology itself? I do not have an answer, but it worked.

Now I decided to find out what "gene" means to try to apply its definition to digital processing of hexadecimal presentation of a genome. The definition turned out to be very broad: some piece of DNA sequence that encode instructions. With such a definition there is no wonder that thay even do not know how many genes in a genome at all. They say that human genome has 20000-25000 genes. Nice range for something called itself a "science"!

Then I found that the situation even worse. They classify genes mostly not by how they works but by how they are broken, and even cannot agree on how to name them.

So we can find gene definition like this:
CSF2RA, ID 1438, updated on 30-Mar-2013,
Also known as: GMR; CD116; CSF2R; SMDP4; CDw116; CSF2RX; CSF2RY; GMCSFR; CSF2RAX; CSF2RAY; GM-CSF-R-alpha.

I pity them, I feel sorry for them.

Tremendous, absolutely astonishing work on DNA sequencing was done in the past 20 years! But it all ended up in such a lousy organized structures. They try to experimentally find a function of every gene, but do not have clearer definition of gene than "piece of DNA" that codes proteins. No wonder that even having the whole DNA sequences of a genome they do not know how many genes are there. Each gene can be found experimentally only. Their way of exploration is a comparison DNA sequences letter-by-letter in a search of something broken, classified the broken part as a genetic disorder, and put this knowledge into all kind of databases with all possible comments on the research.

To define location of a gene in a genome they created very artificial creature named Gene Map Locus. Say, HFE ( aka, aka, aka...) human gene has Locus 6p21.3 wich means that it is located in the short arm (p) of the chromosome 6 in a place inconveniently defined by someone as region 21.3. I am glad that they at least have standard chromosomes numbering. Everything else looks like total nonsense from digital processing point of view. To find a gene data I need to know genetic disorder name it is causing!

They classify mutations like this: C282Y aka CYS282TYR which describes substitution of C base with T base at position 282 in the gene. From hex point of view this is simply loosing a 1 at the position, transforming 1(4) = 01(2) = C to 0(4) = 00(2) = T.

They defined 3-letters bytes they call "codone" to form Table of genetic codes:

It looks that they have chosen number 3 just because the resulting table nicely fits 1 sheet of paper.

I feel that it will be not that easy to find a common language with people who believes in the magic of numbers 3 and 20 instead of nice and round 4 (or better 8 because it gives the discovered pattern) and 32 or even 256 for amino acids, 236 of which would be reasonable to add to the 20 already known. But what I know about all this? Practically nothing. I am just playing with digits. :)

I realized that contemporary genetic has noble but different goal from data processing: their main goal is to fight genetic disorders, but not to explore how it should work, how it was originally programmed.

7. The greatest achievement of genetics

April 20, Barcelona vacation
Please ignore text formatting: I'm not very comfortable with tablets .

Genetic has many issues that I can ridicule, like multiple names for a gene.

Look at their 3-letter "byte" they call "codon" which codes 20 amino acids. AAs are execution state commands that go straight to processing to our internal "CPUs", thus being kind of hardware-side units from our software point of view, and for now not that interesting, at least for a while.

There is a fishy thing about this: different codons can generate the same AA. If I knew this fact before I started my journey to genetic, I would have to explain this experimental fact somehow.

Luckily for me my abstract approach came first, so I did not have this boundaries which would limit my imagination. So I would rather assume that they simply have not considered 4-letter commands (a perfect 8-bit byte) which would generate 256 possible combinations. This fantasy requires more thinking later...

But, besides unprecedented, tremendous efforts on sequencing different genomes, modern genetic has one ultimate, unparalleled achievement:

They made all genetic databases public and free to access.

This gave amateurs like me the ability to explore their cumulative knowledge from any angle, the angle they couldn't possibly predict.

Tuesday, April 2, 2013

Step 5. The Pattern

Previously: Step 1. Digitizing
Previously: Step 2. Visualization
Previously: Step 3. SumMatrix is defined
Previously: Step 4. Video pause

To better recognize the pattern I draw SumMatrix Level1 for cow chromosome 1:

SumMatrix Level1 has only 0 and 1 values. I took this file and put it into Excel spreadsheet.

It was fairly easy to analyze.

Here is a picture of the resulted pattern as I was able to extract from the SumMatrix.

Definitely the pattern has 4 well distinguished objects inside (again four! Number 4 is everywhere!):

3 "square-shape" and one "fence". Square-shapes have border thickness 16, 4, and 1. The fence-shape has thickness 1. Than is why I decided to called them S16, S4, S1, and F1.

Let's describe parameters of each of them. HEX numbers are prefixed with "0x"

Pattern S16

There are 4 squares separated by 16-number "borders".
There are 0x80 x 0x80 squares offset 0 with border width 0x10

1 stripe (width=16): 0x70 - 0x7F

Formula:

(0x70 <= X < 0x80) OR (0x70 <= Y < 0x80)

Pattern S4

There are 4 in a row 0x40 x 0x40 squares, offset 0x20 with border width 4

4 stripes (width=4): 0x1C-0x1F, 0x5C-0x5F, 0x9C-0x9F, 0xDC-0xDF

Pattern S1

There are 16 in a row 0x10 x 0x10 squares, offset 0x8 with border width 1

16 stripes (width=1):
0x07, 0x17, 0x27, 0x37, 0x47, 0x57, 0x67, 0x77, 0x87, 0x97, 0xA7, 0xB7, 0xC7, 0xD7, 0xE7, 0xF7

Pattern F1

There are 64 only vertical stripes started from 0xC0 down to 0xFF offset 0x2 with border width 1.

I notices that 0xC0 is 1/4 from the edge.

Maybe I'm wrong but for symmetry keepsake I would add the same 1/4 from the edge to the horizontal side:

In this case the resulting pattern would be more symmetrical too:

Now everything is perfect and a multiple of 4!

I put pattern parameters in the table:

Pattern	Stripe width	Num of stripes per dimension	Offset	Gap	Stripe range
S16	16	1	0=128	128	0 - 255
S4	4	4	32	64	0 - 255
S1	1	16	8	16	0 - 255
F1	1	64	2	4	192 - 255

or, in HEX:

Pattern	Stripe width	Num of stripes per dimension	Offset	Gap	Stripe range
S16	0x10	0x1	0=0x80	0x80	0x0 - 0xFF
S4	0x4	0x4	0x20	0x40	0x0 - 0xFF
S1	0x1	0x10	0x8	0x10	0x0 - 0xFF
F1	0x1	0x40	0x2	0x4	0xC0 - 0xFF

Now I need to come up with explanation of the patterns in the real world.

Any ideas / hypothesis are welcomed!

Sunday, March 24, 2013

Step 4. Video pause

Previously: Step 1. Digitizing
Previously: Step 2. Visualization
Previously: Step 3. SumMatrix is defined

I believe that SumMatrix is the first step for understanding #Genome as a #Program.
SumMatrix is statistical signature of a genome.

"The Matrix" was fancy fantasy.
I discovered that all living creatures from viruses to Homo Sapiens are based on the matrix with recognizable patterns.

I think that digital era of genomics starts here.
You can be the very first witnesses of this.

Watch SumMartix for different chromosomes of different species.
Notice, that Human statistically closer to Influenza than to Honey Bee.

Saturday, March 23, 2013

Step 3. SumMatrix is defined

Is there any meaningful use here?

Previously: Step 1. Digitizing
Previously: Step 2. Visualization

Each element in the matrix represents sequence of 8 nucleotides.

Thus point with coordinates (X,Y) = Hex (80,7F) = Quaternary (2000, 1333) represents sequence ATTTCGGG

Each dot on the matrix grid represents sum of all occurrences of the sequence in the genome.

That is why I named it

SumMatrix

A SumMatrix is a statistical signature of a genome.
A SumMatrix presents the frequency distribution of 8-letter codons in a genome.
The bigger the value in a cell of the matrix, the darker it looks in the SumMatrix presentation, the more often the combination occurs in a genome.

But the visual patterns gave me a feeling that the less occurred sequences must be more interesting.
Some combinations a rare, some of them are extremely rare, and some, probably, do not exist at all.

What are these rare combinations?
Result of mutations for better organization?
Are they rarely occurred markers which separate important chain of nucleotides?
Or, maybe, they are simply errors in genome sequencing?
Or they are something else?
Or they are all of the above?

Investigation is required, lots of digital analyzing, and new ideas are needed as well.

SumMatrix normalization

Each cell of a SumMatrix is a counter of correspondent 8-letter codone in a genome.

Absolute numbers are not very suitable for statistical analysis: different chromosomes of different species contain very different number of nucleotides. So absolute counters of sequence, say, 7F7F are very different for different genomes.

For comparison and other digital processing I need relative numbers which are not dependent on the length of a genome. After some experiments I decided to use "median" approach: all incoming values I sort and then divide into same-size groups. Then I use group number instead of absolute value of a cell.

Such simplified matrix I call "normalized SumMatrix".
I decided that number of groups (and thus values in the normalized SumMatrix) will be power of 2:
Level 1 SumMatrix contains values 0 and 1 only.
Level 2 SumMatrix contains values 0 thru 3.
. . .
Level 8 SumMatrix contains values 0 thru 255

The higher level of a normalized SumMatrix, the higher accuracy of a genome presentation.
I decided that having maximum 8 levels is enough, simply because in this case I can easily present a value as a shade of gray.

Here is a visual comparison of "full" and "normalized" SumMatrix:

6400K fragment of "full" SumMatrix

Normalized SumMatrix level 8

Normalized SumMatrix is definitely less detailed but should be much better for digital processing.

To be continued...

Friday, March 22, 2013

Step 2. Visualization

How to show 15 million numbers?

Previously: Step 1. Digitizing
Now I have my binary file chrY.fa.tcag.n-t.hex.
Some sort of presentation of all these bunch of numbers would be very useful.

Do you have any idea? I did not.

So, I tried the following: take 2 consecutive bytes and pretend that they are X and Y coordinates on a surface.

This way I would have a square 256 x 256 (FF x FF in hex) with both coordinates changing from 0 to FF.

I took first 64K hex numbers from the file and put a dot in the position defined by the coordinates. The more dots was drawn at the same coordinates, the darker they look.
64K hex numbers represent quarter a million of nucleotides.

Frankly, I expected to see some randomly spreaded dots.

Here was the result:

OK. This was kind of expected. I clicked a button in my program a few times more, each time adding another 64K numbers.

256K: are there some pattern there or this is just my imagination?

I started to add more and more data.
704K:

1344K:

1984K:

4416K:

Wow! This is incredible! This is human matrix.
Have I discovered something amazing?
Fills even scary.

I stopped. I need to save my breath. I need somehow to comprehend the result.

Click the image to see it in action

First I decided to compare the picture with another binary files. Maybe I do not know something about my newly created way of file presentation. Maybe they all look the same way.

Let's take, say, zip file.

Looks pretty random to me. Better I try exe or dll. They are programs too after all.

Looks cool but very far from human matrix.

I decided to try different species.
Is this matrix uniquely human or what?

First I found a mouse.

No, we are not much different from a mouse.

What about, say, lizard?

Wow! Lizard has the same pattern too.
I tried cow, dog, lancelet (primitive fish-like creature), blowfish, wild boar, and orangutan.
They all have the same pattern!

Different pattern I found in worms:

which resembles a pattern of mosquito:

and honey bee:

I'm hooked. I want to try everything.

Bacteria. E.Coli:

Influenza:

It looks closer to human than to bee or E.Coli.

Let me try plants.
Glicine:

I believe that you've got the idea. I have.

To be continued...

Thursday, March 21, 2013

Step 1: Digitizing.

Genome encrypted in DNA is a program.

This is well known fact, at least for me.
But what is its instruction set?
First, I need to find out what digits are used for the instructions.

I know nothing about biology and chemistry.
I forgot most of my Math.

I just like to play with numbers.

That is why I was very intrigued when I found out that genome contains only four nucleotides: A, C, G, T.
This is amazing! Not 3, not 7 or 10, or 255.
Just 4!

This is perfect for 4 digits in Quaternary (base-4) system.

Note for less digitally-literate audience.

Wildly used digital systems are:

- Decimal (base-10) has 10 digits: 0,1,2,3,4,5,6,7,8,9

- Binary (base-2) has 2 digits: 0,1

- Octal (base-8) has 8 digits: 0,1,2,3,4,5,6,7

- Hexadecimal (base-16) has 16 digits: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F

So for Quaternary (base-4) system I just need to assign digits 0 to 3 for the known nucleotides.
I decided that the most natural way of arranging the nucleotides is by their weight.

Thus, here is the assignment:

T - 0
C - 1
A - 2
G - 3
I can use 2 bits to encode one nucleotide.

Now sequence
AGTGTATAGGAATGCAAGGGATTTCTGTGTGT
I can write in Quaternary system as
23030202332203122333200010303030
or in Binary system
1011001100100010111110100011011010111111100000000100110011001100
or in Hexadecimal system
B322FA36BF804CCC

Side note
By the way, this is the beginning sequence of the X chromosome of a cow
(taken from Bos_taurus.UMD3.1.68.dna.chromosome.X.fa marked as
"X dna:chromosome chromosome:UMD3.1:X:1:148823899:1 REF" whatever it means)

I downloaded a few genomes in FASTA format and transformed them into HEX format.
As the sources I used the following FTP sites:

ftp://hgdownload.cse.ucsc.edu/apache/htdocs/goldenPath/currentGenomes/
ftp://ftp.ensembl.org/pub/mnt2/release-70/fasta/
ftp://ftp.ncbi.nlm.nih.gov/genomes/

Definitely, I have started from most intrigued Human Y-chromosome. The result of transformation you can pick up from this folder on my Google Drive.

The original text file chrY.fa was about 60MB.

The resulting binary file chrY.fa.tcag.n-t.hex is about 15MB.

Side note

"hex" in the extension of the file name indicates that this is binary file.

"n-t" indicates that I replaced all N(unknown) with T. I needed to make this

replacement: there is no N nucleotide after all.

"tcag" indicates how I translated symbols to digits, in this case T=0,C=1,A=2,G=3

Size of the resulting file is 4 times less than the original one.

This is because each symbol (one byte = 8 bits) in the original file now occupies 2 bits only,

and each byte of the result represents 4 original symbols.

Now, when I have 15 millions hexadecimal numbers, I can analyze them digitally. I have no idea what to do with them.

Now, when I know coding base of The Program, I can try to look for something interesting in these numbers.

Maybe there are some fundamental constants hidden there?

Say, π or e or the most important of all "The golden ratio φ"?

There is one problem: I do not know how to encode real (non-integer) numbers in hexadecimal format.

I'll think about this later...

Brief summary of the Step 1

The Life has Quaternary (base-4) system as its fundamental base.

Let's not create some artificial "genome coding" with bunch of silly symbols like S=CG or D=AGT.

I saw that there are dozens of formats for genome storage.

Let's store all genetic information in one unified format - the original format it was created. Look out of the window: it's digital age over there!

There are some Pro and Con of using binary format as genomic storage. Binary genomic format is precise. This is its strength, this is its weakness.

Inside the file there is no place for comments, for Unknown pieces, lacuna, for results of explorations, for grouping, etc.

All this additional information must be stored separately, with indexes/pointers to the original binary file. Creators and lovers of new formats have broad field of possibilities, as broad as their imagination.

But binary files have much more power for digital processing and analyzing.

So,

Let's store all genetic information in one unified format - binary, as dictated by the fundamental base.

To be continued...

SumMatrix of life: Genome as a program

Blog Archive