Tuesday, December 6, 2016

k-mer wobble for TP53 signatures

We analyzed data from 12 - P53 transcripts by following 2765 theoretical k-mers selected because they started with the same two letters in every transcript. We sorted these by their Sequence Start (in the origin DNA), iScore (col C) then by k-mer recurrence in the transcripts' bigger length k-mers (col J) [column reference is to the linked file].

In the chart below, transcript k-mers that demonstrated sequential position in iScore vector (PiV) stability produced uninterrupted lines. k-mer stability/instability is a function of iScore (discussed above) and plot of the sequence of k-mer PiV [value's].

We plotted each k-mer PiV for each transcript. For each transcript, each k-mer can occupy one of twelve positions and each PiV for all transcripts must be occupied 2765 times (see last chart below)

At row 2459 above (top left), there is a single line perturbation (not visible in this image). We traced it to k-mer (#939134) where we discovered that the addition of "G" at nucleotide 1377 caused Transcript T793 to change order with T597, T465. The result of "G" was an iSocre change that caused the order of transcripts to flip.

The next image demonstrate the total PiV (row1) distributions for all 2765 k-mers of each transcript (col N).

Thursday, December 1, 2016

Predicting Reverse Complements, Inverted Repeats and Mites

Reverse complements and inverted repeats are one of the most compelling pieces of evidence supporting a Codondex variable k-mer computation. The table below for men1 transcript 702, highlights some of our findings, which have been repeated. 

Two main points to keep in mind: Data on left and right of the table were sorted on different column parameters. Column B-K was sorted according to Col L, count of the Reverse Complement ("RCk-mer") discovered in the k-mers listed in column E. Column N-X was sorted according to column V, count of the k-mer found in all k-mers (column E). These are Reverse Complements and Inverted Repeats. 

Things get interesting in row 22/28. These are the k-mers and their inverted repeats ordered by column V, the k-mer count. Row 31 identifies the base sequence start and end position of the k-mer:

@D31 sequence start 0...7 end
@F31 sequence start 1...8 end
@H31 sequence start 0...8 end
@J31 sequence start 2...9 end
@L31 sequence start 1...9 end
@N31 sequence start 0..9 end, from this you can see that all these previous k-mers were included, making them redundant for this purpose. 

The Position (in) iScore Vector (PiV) (Row 25 and 30) of each of these k-mers = 11. The PiV for each RCk-mer = 9 or 10. There is no sequence overlap in the k-mers of columns P-X.

We decided to test whether the individual RCk-mer count in each k-mer could be predicted. In this transcript there are 227,475 individual k-mers which contain a total 1,563,308 RCk-mers (inverted repeats) - this occurs because multiple, shorter length RCk-mers may be found in the same longer length k-mer. We developed an algorithm to establish a +80% accurate training set and used Random Forrest to predict the counts for the balance.

Without sequence text, using only iScore data, we accurately predict RCk-mer counts in 99.5-100% of k-mers for any given gene transcript.