## Thursday, December 1, 2016

### Predicting Reverse Complements, Inverted Repeats and Mites

Reverse complements and inverted repeats are one of the most compelling pieces of evidence supporting a Codondex variable k-mer computation. The table below for men1 transcript 702, highlights some of our findings, which have been repeated.

Two main points to keep in mind: Data on left and right of the table were sorted on different column parameters. Column B-K was sorted according to Col L, count of the Reverse Complement ("RCk-mer") discovered in the k-mers listed in column E. Column N-X was sorted according to column V, count of the k-mer found in all k-mers (column E). These are Reverse Complements and Inverted Repeats.

Things get interesting in row 22/28. These are the k-mers and their inverted repeats ordered by column V, the k-mer count. Row 31 identifies the base sequence start and end position of the k-mer:

@D31 sequence start 0...7 end
@F31 sequence start 1...8 end
@H31 sequence start 0...8 end
@J31 sequence start 2...9 end
@L31 sequence start 1...9 end
@N31 sequence start 0..9 end, from this you can see that all these previous k-mers were included, making them redundant for this purpose.

The Position (in) iScore Vector (PiV) (Row 25 and 30) of each of these k-mers = 11. The PiV for each RCk-mer = 9 or 10. There is no sequence overlap in the k-mers of columns P-X.

We decided to test whether the individual RCk-mer count in each k-mer could be predicted. In this transcript there are 227,475 individual k-mers which contain a total 1,563,308 RCk-mers (inverted repeats) - this occurs because multiple, shorter length RCk-mers may be found in the same longer length k-mer. We developed an algorithm to establish a +80% accurate training set and used Random Forrest to predict the counts for the balance.

Without sequence text, using only iScore data, we accurately predict RCk-mer counts in 99.5-100% of k-mers for any given gene transcript.