Two main points to keep in mind: Data on left and right of the table were sorted on different column parameters. Column B-K was sorted according to Col L, count of the Reverse Complement ("RCk-mer") discovered in the k-mers listed in column E. Column N-X was sorted according to column V, count of the k-mer found in all k-mers (column E). These are Reverse Complements and Inverted Repeats.
Things get interesting in row 22/28. These are the k-mers and their inverted repeats ordered by column V, the k-mer count. Row 31 identifies the base sequence start and end position of the k-mer:
@D31 sequence start 0...7 end
@F31 sequence start 1...8 end
@H31 sequence start 0...8 end
@J31 sequence start 2...9 end
@L31 sequence start 1...9 end
@N31 sequence start 0..9 end, from this you can see that all these previous k-mers were included, making them redundant for this purpose.
The Position (in) iScore Vector (PiV) (Row 25 and 30) of each of these k-mers = 11. The PiV for each RCk-mer = 9 or 10. There is no sequence overlap in the k-mers of columns P-X.
We decided to test whether the individual RCk-mer count in each k-mer could be predicted. In this transcript there are 227,475 individual k-mers which contain a total 1,563,308 RCk-mers (inverted repeats) - this occurs because multiple, shorter length RCk-mers may be found in the same longer length k-mer. We developed an algorithm to establish a +80% accurate training set and used Random Forrest to predict the counts for the balance.
Without sequence text, using only iScore data, we accurately predict RCk-mer counts in 99.5-100% of k-mers for any given gene transcript.