We thought long and hard about signals and interactions between proteins as they move over each nucleotide. We came up with a few different ideas that may be representative of encoding, but limited any novelty to agree with Watson Crick. In the end we settled on a subtle, but alternative way of counting nucleotides for coding protein or other signaling. We tested it, made certain our results were not random and eventually used it to direct our research.
First we iterated the development of a sequence amplification tool to compute and segment every possible subsequence (k-mer/n-gram) of any single strand DNA sequence, so that each subsequence could be represented and independently analyzed. One example for a TP53 intron transcript of 400 nucleotides (letters) produced +77,000 k-mers containing a total 10,735,712 letters. From this we made a striking observation of sequence symmetry that could not be observed without obtaining these k-mers.
Pattern(Si) = Si-7...Si | Si-8...Si | Si-9...Si | ... | S0S1...Si
For example: G0T1G2G3G4C5C6C7A8G9A10C11
Pattern(G0) = Empty (we start at k-mer 7)
Pattern(C7) = GTGGGCCC
Pattern(A8) = TGGGCCCA | GTGGGCCCA
Pattern(G9) = GGGCCCAG | TGGGCCCAG | GTGGGCCCAG
Pattern(C11) = GCCCAGAC | GGCCCAGAC | GGGCCCAGAC | TGGGCCCAGAC | GTGGGCCCAGAC
Most discussion about symmetry relates to the recurrence of letters in different regions of a sequence like inverted repeats. Here we demonstrate an added dimension of symmetry by counting k-mer or subsequence recurrence among letters of all k-mers. Systematically, one can expect that a k-mer, if iterated by one letter will not be significantly different from a previous k-mer and for the most part that is true. However, we found approximately 15-20% of k-mers, in multiple gene transcripts shared a unique recurrence (identical count) with a specific k-mer pair.
To clarify pair; we count recurrence of a k-mer in all bigger length k-mers and we repeat this by ignoring length so that smaller k-mers will also be counted if they recur in the k-mer. With this information, we can deduct bigger length recurrence from ignore length recurrence to count forward, toward 5' or behind, toward the 3' end of the sequence. The k-mer pairs we refer to in this document have a pair of equal recurrences for bigger and for ignore length. This can be see in the TP53 example and other transcripts on our web site.