Monday, April 22, 2013

10. Digital Genetics is born


4/19/2013. Mediterranean sea, near coast of Spain.

"Digital Genetics" is a preliminary name for a new scientific discipline that would use Digital/Informational Technology methods and approaches, IT way of thinking, to explore genetic tasks from informational point of view.

Discovered SumMatrix patterns in genomes are to be explained. Biological sense of the patterns need to be found. This may lead to re-classification of the living species by the patterns.

Genomic entities, like chromosome, gene, junk DNA, mRNA join, etc., need to be defined digitally. Tools for digital search of genes (at least) should be created.

SumMatrix analysis is needed for
- comparison genomes of different species
- comparison chromosomes of one genome
- comparison of different iterations of a DNA/RNA sequencing
- adding / subtracting, say, virus or multiple viruses genomes to another ones
- many more, I even can't imagine right now

The ultimate goal is to define the instruction set and architecture of the biological processor which we call "life".


* * *

The immediate things that could be done right now are

- creating tools for re-digitazing genetic databases into hex format
- creating tools for calculating SumMatrix and its normalization
- creating tools for SumMatrix comparison and presentation the result of the difference
- creating tools for analysing SumMatrix by patterns
- classify SumMatrix patterns, try to find how many patterns exit
- create centralized online resource for communications, coordination of actions and presenting the results of reseaches
- re-think genetic code table of amino acid production, probably re-classify the resulting proteomes
and many more...

From us, software/IT professionals to them, genetics and biologists:

Let us help you to help all of us.


April 22, 2013. Barcelona

Saturday, April 20, 2013

9. This reminds me something


8. How to explore unknown program for unknown operating system running unknown instruction set on unknown PC


4/14/2013 mid-Atlantic


I try to find analogy for exploring genome as a program.
Like any other analogy this one is as good as it is bad.

Imagine highly advanced non-biological alien race, say "Crystal entity", or The Crystallites came to the Earth unnoticed to explore local life.

They discovered that live on Earth appeared approximately 2.2e21 picoseconds ago (about 70 of our years) though paleological researchers discovered very early sings of life as far as 4.7e21 ps ago. The Crystallites scientists still argue if this mechanical garbage (known to us as Babbage garbage) is possible to consider a life, especially because it never really came to life.

The evolution of the life came long way starting from elementary monster-like electronic bulb species with just a few codes in their instruction set to sophisticated all-in-one creatures that formed current biosphere on Earth (internet in our terms).

Despite of all wide differences in size, shape, color, architecture, and technology they had one thing in common: all signals, commands and whatever, was organized in bytes of 8 bits. Though there were some died-out branches of the evolution that used 3-state bits or 6-bit bytes, they ... died out in the fight for survival. Thus 8-bit byte is the minimal unit of any physical signal, command or data. Sure that smaller portion of a byte could have some meaning, say half-byte (hex) or even 1 bit (binary), but minimal stored/processed unit is still the byte. The consequtive bytes can be combined into larger but still elementary units by 2, 4, 8, even 16 bytes (what we call "a word"). The evolution of life knows attempts to extend a word with extra byte (tag for describing the word and for error tolerance) or even 1-2 extra bits to distinguish command from data and for error checking. But looks like none of them passed this fight for survival. Looks like life prefers to be more economical than error-free.

The Crystallites (TCs for short) started to explore different species (which we call mainframes, PCs, tablets, smart phones, printers, scanners, TVs, players, etc.), the whole zoo of the life forms. They found out that every living organism contains big blobs of digits which they decided to call, say, chromosomes (and what we call programs). They assign a number for each known for them chromosome and started expore each of them separately.

First they sequenced as much chromosomes as possible to look at the hex content.

Then TCs discovered that every chromosome has common for different species parts which they decided to call, say, chromosome region (and what we could imagine as statically embedded DLLs). They discovered that each region contains coding fragments each of which they decided to call, say, gene (function in our world).

TCs decided to find out what each gene is responsible for. So, they cutout a single gene and put it for processing. They noticed that if they start execution from particular offset, in 85% of the experiments speakers start to produce a sound of random frequency (they are not aware of parameters). They even found that the function has multiple entry points with multiple exit points. So they classified the gene as belonging to a sound phenotype and conveniently named it RFSP aka random-frequency-sound-producer. Tremendous work was done trying to find a functional responsibility of as many genes as possible.

Isn't it silly way of exploring an unknown?
Probably, not.
How else you can explore a black box you have no idea about?

So far, so good.

There is a catch. It is very easy to damage a gene, so it doesn't work properly.
But it is very hard to create meaningful new or modify existing gene to produce the result they would need. Say, modify RFSP gene so it would produce 440Gz sound. It is possible. They can try to change a value in each byte of the gene, eventually replaced a parameter with constant 440. The gene does what they wanted. But the living specie starts beeping instead of talking.

So, genetic modification is extremely dangerous and may lead to unpredictable results if their knowledge is based on experimental data only. Knowledge of the instruction set and common structure (architecture) is required to get any meaningful result. Understanding in this area would give absolutely astonishing possibilities. But who knows when these breakthroughs may happened? You can spend many e20 ps and do not have any meaningful result after all.

So, The Crystallites decided to explore damages in genes that happens, though rarely, during copying of the chromosome regions from one specie to another.
They noticed that if gene CNSS aka create-new-SPREADSHEET is losing one bit at the position 284, only halve of the expected spreadsheet appears on screen. They named the mutation 1-284-0, classified the genetic disorder as SPREADSHITTING DISEASE, put descriptions in all their databases, and continue to explore another damages.

Because The Crystallites do not recognize our biological life, just don't see it, unable to see, they probably consider evolution of machine life as series of random changes, mutations, which in zillions of generations perfected in the fight for survival. They may theorize about different directions in this evolution, discuss why some branches of species are died out. They may find a proof of energy crisis which hit the Earth in the pre-historic time and wiped out less energy-efficient monstrous species.
They witnessed the birth of tiny 8-bit species which took over the world after all.

They may argue that such an evolution cannot be random, there should be a master programmer, a god behind all this. They my discuss evolutionism vs. creationism. Uh-h-h. It's getting scary. I'd rather not elaborate this further.

I think you've got the idea.
Use your imagination.

But please remember what I started from:
This analogy is just as good as it is bad. :-)



Even with all oversimplification, somehow, I feel that genetic science went through the similar path.

6. I am not afraid to sound silly




4/13/2013 mid-Atlantic

I am not afraid to sound silly

As an amature explorer without deep knowledge of the subject I do not have any boundaries of well established knowledge. My boundaries are my imagination only. I am not afraid of telling silly (from the official science point of view) things if I found them reasonable. Sure I should not rely on changing of fundamental physical constants though.

When I started my readings on genomics I frankly was very surprised, to put it mildly.

Last 50-60 years transformed digital technology into mature science with well defined and well established organization, in all facets of the industry from chemical and physical achievements to data processing and distribution and graphical interface.
I am happy that most of this timeframe I witnessed myself and was part of this developments myself too.
No doubt there were lots of quarrels as well as mutual agreements for coding, protocols, and other subjects of mutual interests.

Side notes.
1. I remember existence of 9 different code tables for Cyrillic alphabet.
2. When first prototypes of the Soviet supercomputer Elbrus were built someone decided to make it super-Russian as well. So they renamed hexadecimal digits
A B C D E F into Cyrillic
А Б В Г Д Е
and we had two consols: one with latin hexadecimals and another with Cyrillic.
Thus, say, 7BDF on latin console looked as 7БГЕ on the Cyrillic one.
After all the international standard was picked up. :)


What has helped? RFCs? Maybe informational technology itself? I do not have an answer, but it worked.

Now I decided to find out what "gene" means to try to apply its definition to digital processing of hexadecimal presentation of a genome. The definition turned out to be very broad: some piece of DNA sequence that encode instructions. With such a definition there is no wonder that thay even do not know how many genes in a genome at all. They say that human genome has 20000-25000 genes. Nice range for something called itself a "science"!

Then I found that the situation even worse. They classify genes mostly not by how they works but by how they are broken, and even cannot agree on how to name them.

So we can find gene definition like this:
CSF2RA, ID 1438, updated on 30-Mar-2013,
Also known as: GMR; CD116; CSF2R; SMDP4; CDw116; CSF2RX; CSF2RY; GMCSFR; CSF2RAX; CSF2RAY; GM-CSF-R-alpha.

I pity them, I feel sorry for them.

Tremendous, absolutely astonishing work on DNA sequencing was done in the past 20 years! But it all ended up in such a lousy organized structures. They try to experimentally find a function of every gene, but do not have clearer definition of gene than "piece of DNA" that codes proteins. No wonder that even having the whole DNA sequences of a genome they do not know how many genes are there. Each gene can be found experimentally only. Their way of exploration is a comparison DNA sequences letter-by-letter in a search of something broken, classified the broken part as a genetic disorder, and put this knowledge into all kind of databases with all possible comments on the research.

To define location of a gene in a genome they created very artificial creature named Gene Map Locus. Say, HFE ( aka, aka, aka...) human gene has Locus 6p21.3 wich means that it is located in the short arm (p) of the chromosome 6 in a place inconveniently defined by someone as region 21.3. I am glad that they at least have standard chromosomes numbering. Everything else looks like total nonsense from digital processing point of view. To find a gene data I need to know genetic disorder name it is causing!

They classify mutations like this: C282Y aka CYS282TYR which describes substitution of C base with T base at position 282 in the gene. From hex point of view this is simply loosing a 1 at the position, transforming 1(4) = 01(2) = C to 0(4) = 00(2) = T.


They defined 3-letters bytes they call "codone" to form Table of genetic codes:

It looks that they have chosen number 3 just because the resulting table nicely fits 1 sheet of paper.

I feel that it will be not that easy to find a common language with people who believes in the magic of numbers 3 and 20 instead of nice and round 4 (or better 8 because it gives the discovered pattern) and 32 or even 256 for amino acids, 236 of which would be reasonable to add to the 20 already known. But what I know about all this? Practically nothing. I am just playing with digits. :)

I realized that contemporary genetic has noble but different goal from data processing: their main goal is to fight genetic disorders, but not to explore how it should work, how it was originally programmed.

7. The greatest achievement of genetics

April 20, Barcelona vacation
Please ignore text formatting: I'm not very comfortable with tablets .

 Genetic has many issues that I can ridicule, like multiple names for a gene.

Look at their 3-letter "byte" they call "codon" which codes 20 amino acids. AAs are execution state commands that go straight to processing to our internal "CPUs", thus being kind of hardware-side units from our software point of view, and for now not that interesting, at least for a while.

There is a fishy thing about this: different codons can generate the same AA. If I knew this fact before I started my journey to genetic, I would have to explain this experimental fact somehow.

Luckily for me my abstract approach came first, so I did not have this boundaries which would limit my imagination. So I would rather assume that they simply have not considered 4-letter commands (a perfect 8-bit byte) which would generate 256 possible combinations. This fantasy requires more thinking later...

 But, besides unprecedented, tremendous efforts on sequencing different genomes, modern genetic has one ultimate, unparalleled achievement:


They made all genetic databases public and free to access.

This gave amateurs like me the ability to explore their cumulative knowledge from any angle, the angle they couldn't possibly predict.

Tuesday, April 2, 2013

Step 5. The Pattern


Previously: Step 1. Digitizing
Previously: Step 2. Visualization
Previously: Step 3. SumMatrix is defined
Previously: Step 4. Video pause

To better recognize the pattern I draw SumMatrix Level1 for cow chromosome 1:

SumMatrix Level1 has only 0 and 1 values. I took this file and put it into Excel spreadsheet.
It was fairly easy to analyze. 
Here is a picture of the resulted pattern as I was able to extract from the SumMatrix.

Definitely the pattern has 4 well distinguished objects inside (again four! Number 4 is everywhere!): 
3 "square-shape" and one "fence". Square-shapes have border thickness 16, 4, and 1. The fence-shape has thickness 1. Than is why I decided to called them S16, S4, S1, and F1.

Let's describe parameters of each of them. HEX numbers are prefixed with "0x"

Pattern S16

There are 4 squares separated by 16-number "borders".
There are 0x80 x 0x80 squares offset 0 with border width 0x10


1 stripe (width=16): 0x70 - 0x7F 

Formula:
(0x70 <= X < 0x80) OR (0x70 <= Y < 0x80)

Pattern S4

There are 4 in a row  0x40 x 0x40 squares, offset 0x20 with border width 4




4 stripes (width=4): 0x1C-0x1F, 0x5C-0x5F, 0x9C-0x9F, 0xDC-0xDF

Pattern S1

There are 16 in a row 0x10 x 0x10 squares, offset 0x8 with border width 1


16 stripes (width=1):
0x07, 0x17, 0x27, 0x37, 0x47, 0x57, 0x67, 0x77, 0x87, 0x97, 0xA7, 0xB7, 0xC7, 0xD7, 0xE7, 0xF7


Pattern F1

There are 64 only vertical stripes started from 0xC0 down to 0xFF offset 0x2 with border width 1.
I notices that 0xC0 is 1/4 from the edge. 

Maybe I'm wrong  but for symmetry keepsake I would add the same 1/4 from the edge to the horizontal side:
In this case the resulting pattern would be more symmetrical too:

Now everything is perfect and a multiple of 4!

I put pattern parameters in the table:
Pattern Stripe width Num of stripes per dimension Offset Gap Stripe range
S16 16 1 0=128 128 0 - 255
S4 4 4 32 64 0 - 255
S1 1 16 8 16 0 - 255
F1 1 64 2 4 192 - 255
or, in HEX:
Pattern Stripe width Num of stripes per dimension Offset Gap Stripe range
S16 0x10 0x1 0=0x80 0x80 0x0 - 0xFF
S4 0x4 0x4 0x20 0x40 0x0 - 0xFF
S1 0x1 0x10 0x8 0x10 0x0 - 0xFF
F1 0x1 0x40 0x2 0x4 0xC0 - 0xFF

Now I need to come up with explanation of the patterns in the real world.

Any ideas / hypothesis are welcomed!