Tuesday, December 6, 2016

k-mer wobble for TP53 signatures

We analyzed data from 12 - P53 transcripts by following 2765 theoretical k-mers selected because they started with the same two letters in every transcript. We sorted these by their Sequence Start (in the origin DNA), iScore (col C) then by k-mer recurrence in the transcripts' bigger length k-mers (col J) [column reference is to the linked file].

In the chart below, transcript k-mers that demonstrated sequential position in iScore vector (PiV) stability produced uninterrupted lines. k-mer stability/instability is a function of iScore (discussed above) and plot of the sequence of k-mer PiV [value's].

We plotted each k-mer PiV for each transcript. For each transcript, each k-mer can occupy one of twelve positions and each PiV for all transcripts must be occupied 2765 times (see last chart below)




At row 2459 above (top left), there is a single line perturbation (not visible in this image). We traced it to k-mer (#939134) where we discovered that the addition of "G" at nucleotide 1377 caused Transcript T793 to change order with T597, T465. The result of "G" was an iSocre change that caused the order of transcripts to flip.




The next image demonstrate the total PiV (row1) distributions for all 2765 k-mers of each transcript (col N).



Thursday, December 1, 2016

Predicting Reverse Complements, Inverted Repeats and Mites

Reverse complements and inverted repeats are one of the most compelling pieces of evidence supporting a Codondex variable k-mer computation. The table below for men1 transcript 702, highlights some of our findings, which have been repeated. 

Two main points to keep in mind: Data on left and right of the table were sorted on different column parameters. Column B-K was sorted according to Col L, count of the Reverse Complement ("RCk-mer") discovered in the k-mers listed in column E. Column N-X was sorted according to column V, count of the k-mer found in all k-mers (column E). These are Reverse Complements and Inverted Repeats. 






Things get interesting in row 22/28. These are the k-mers and their inverted repeats ordered by column V, the k-mer count. Row 31 identifies the base sequence start and end position of the k-mer:

@D31 sequence start 0...7 end
@F31 sequence start 1...8 end
@H31 sequence start 0...8 end
@J31 sequence start 2...9 end
@L31 sequence start 1...9 end
@N31 sequence start 0..9 end, from this you can see that all these previous k-mers were included, making them redundant for this purpose. 

The Position (in) iScore Vector (PiV) (Row 25 and 30) of each of these k-mers = 11. The PiV for each RCk-mer = 9 or 10. There is no sequence overlap in the k-mers of columns P-X.

We decided to test whether the individual RCk-mer count in each k-mer could be predicted. In this transcript there are 227,475 individual k-mers which contain a total 1,563,308 RCk-mers (inverted repeats) - this occurs because multiple, shorter length RCk-mers may be found in the same longer length k-mer. We developed an algorithm to establish a +80% accurate training set and used Random Forrest to predict the counts for the balance.

Without sequence text, using only iScore data, we accurately predict RCk-mer counts in 99.5-100% of k-mers for any given gene transcript.















Tuesday, November 29, 2016

Position (in) iScore Vector exposes Intron - Protein Relationships?

16,904,172 theoretical k-mers were computed for 12 transcripts of Tp53 downloaded from ensembl. Each transcripts' k-mer set contains 1,408,681 entries computed in the same nucleotide order. k-mers were associated with a signature of the transcript's protein. The protein signatures were computed using the same process.

As previously described a ranking based on recurrence was computed for each k-mer using iScore algorithms. We used a vector which we call PiV to assign each k-mer to a position in the order of the 12 k-mers each for a single transcript. 

The chart represents, for each transcript the distribution of its k-mers to positions 1-12 based on the iScore and protein signatures. For the 1,408,681 k-mers, at least one position in the vector is dominant in each transcript. 




Sunday, November 27, 2016

k-mers, Vectors, ncDNA (Intron) and Protein Relativity


Following a particular theory of distributed computing, we developed a method using vectors to expose relative differences in the sort order of subsequence's (k-mers) of transcripts of non-coding DNA. Our expectation is that vectors, which have been used successfully to partition data and determine redundancy can also be used to isolate ncDNA k-mers that played a role in the protein the genes coding DNA encoded. 

We first associated each k-mer with the transcript's protein/m-RNA sequence signature. Then, we computed the vector at each [k-mer/protein signature] for all transcripts. We referenced each at the same start or end position in their base sequence. This provided a novel comparison for ranking k-mers of transcripts. 

There are two ways to derive a vector from a k-mer. Transcripts in the image below labeled T1-T4 represent two different ordering methods. The first (left) is a constant because every k-mer of each transcript will have the same protein signature. The second (right) is established using the i-Score of the k-mer. A single application of a max-order rule is applied to re-order transcripts with equal value iScore's provided the re-order equates with a protein signature order.



The image above represents a k-mer of four different transcripts (left) protein signature sort yields k-mer iScore vector (T3,T2,T4,T1) different to protein signature vector (T2,T3,T4,T1). We refer to each transcript k-mers position in this as Protein hash Vector (PhV). On the right the iScore vector (T1, T2, T3, T4) based on k-mer sort, adjusted for max-order to protein signature. We refer to transcript position as Position (in) iScore Vector (PiV). Using these vectors and their differences, we rank k-mer/protein entries for each transcript in the set.

The actual result for the 1000th k-mer of 15 men1 transcripts can be seen here. Each k-mer can be queried independently. 





A new powerful visualization of DNA intertranscript Sequences

Visualizations are very important in interpreting and understand data sets. Often with bioinfomatics it is challenging to visualize certain elements in DNA sequences. Using our methods one can use a powerful technique which can be tuned to ones needs to visualize DNA sequences.

To view and analyze differences in the transcripts of the same gene can be an important tool to compare different DNA sequences. To support this, the advances of image processing and machine learning can further be used to gain insight and find patterns in a large scalable fashion.

To visualize our data all one needs to do is find the center of each sequence to indicate that sequence middle or average position.

For example if one has a sequence which starts at position 310 and ends at 350 then the middle is (310 + 350)/2 = 330. For each transcripts one can plot the Position in iVector value relative to the sequences center value. One can also add another dimension like Diff as well if they like as the colour variable of each point.

The images below show the results for the 6 SET ensemble transcripts and the 15 MEN1 ensemble transcripts. In these images the length maximum was 50.

SET Transcripts



MEN1 Transcripts


In the image above one can see the MEN1 transcripts. Certain groupings were made based on image similarity. Also the numbers indicated transcripts with the same protein amino acid sequences. There are 5 - ones (1) and 3 - twos (2) and 2 - threes (3). Once can see that for sequences that have the same protein sequence there exists a similarity in the images.





Subsequence Recurrence and Inherent Length Bias



As reported, using our sequence amplification tool every possible subsequence of a single strand DNA sequence is represented and independently analyzed. One example for a TP53 intron transcript of 400 nucleotide's (letters) produced +77,000 subsequence's containing a total 10,735,712 letters.  

We queried and counted the recurrence of every subsequence among the letters of every subsequence using bigger length and ignore length parameters. To normalize length bias we introduced iScore, which divided the subsequence recurrence count by the subsequence length according to the following formula;

iScore 
(Subsequencei)=(Offset_Count_Ignore_Length(Subsequencei)- Offset_Count_Bigger_Length(Subsequencei) ) ÷ Length(Subsequencei)

To our surprise when we sorted all subsequence's by iScore, more than 95% maintained their length groupings. Moreover, each susbsequence has a unique recurrence inherent in its letters and relative to the letters of each other subsequence from the sequence.

men1iScore subsequence's
We further confirmed that iScore can accurately identify functional and known RNA sequences in previous research.

For instance using the men1 gene's iScore report for Transcript 313, which as stated has normalized length we identified that at the 10-11 length change junction a group of subsequence's that retained their length group offered up meaningful information.

For men1, two short subsequence's in the junction were functional. When our further analysis revealed both short subsequence's AGCCTTGTGAG and GTGGAATCTT were located in a single longer subsequence we were able to discover the RNA strand compilation in the image on the right.









Inclusive DNA sequences - new metrics with Codondex

Sequence Text Search using Codondex

At Codondex one of our central metrics that we have developed is known as iScore. Please see iScore Documentation to view how we calculate iScore.

Like a kmer, but variable we start from (L)ength 7. Assuming a sequence L=400 we compute 399 398, 397...7 kmers. Then, for every kmer we query the kmer subsequence in all kmers and count recurrence.



The red text in the image above represents the variable length kmer, analyzed per offset/window.

The iScore algorithm will count all of these kmer’s as well as their recurrence in other kmers generated from other parts of the base sequence.

We also use a metric called the theoretical iScore which are the recurrence due to the patterning algorithm. In this post we are not going to go into the mechanics of how this is calculated. We can then subtract the real iScore from this theoretical iScore to find how many recurrences from other parts of the base transcript.
Using this system we can then isolate parts of the sequence with highest recurrence in other parts of the base transcript and select and filter by length of the subsequence (kmer) and start and end positions in the base sequence.

Here a sample table which was developed using these metrics with some filtering.