Sunday, November 27, 2016

Subsequence Recurrence and Inherent Length Bias

As reported, using our sequence amplification tool every possible subsequence of a single strand DNA sequence is represented and independently analyzed. One example for a TP53 intron transcript of 400 nucleotide's (letters) produced +77,000 subsequence's containing a total 10,735,712 letters.  

We queried and counted the recurrence of every subsequence among the letters of every subsequence using bigger length and ignore length parameters. To normalize length bias we introduced iScore, which divided the subsequence recurrence count by the subsequence length according to the following formula;

(Subsequencei)=(Offset_Count_Ignore_Length(Subsequencei)- Offset_Count_Bigger_Length(Subsequencei) ) ÷ Length(Subsequencei)

To our surprise when we sorted all subsequence's by iScore, more than 95% maintained their length groupings. Moreover, each susbsequence has a unique recurrence inherent in its letters and relative to the letters of each other subsequence from the sequence.

men1iScore subsequence's
We further confirmed that iScore can accurately identify functional and known RNA sequences in previous research.

For instance using the men1 gene's iScore report for Transcript 313, which as stated has normalized length we identified that at the 10-11 length change junction a group of subsequence's that retained their length group offered up meaningful information.

For men1, two short subsequence's in the junction were functional. When our further analysis revealed both short subsequence's AGCCTTGTGAG and GTGGAATCTT were located in a single longer subsequence we were able to discover the RNA strand compilation in the image on the right.

No comments:

Post a Comment