Monday, August 13, 2018

Deep k-mer - new dimension analysis

Sometime back we published a paper that expanded on TP53 intron1 k-mer relationships we have been investigating using our algorithms. We described how TP53 response elements bind p53 monomers and more complex response elements bind and form P53 tetramer's that support known transcription events.

Since that time we have been conducting numerous laboratory tests in conjunction with Professor Noam Shomron at TelAviv University to confirm that sequences we identified from our TP53 bioinformatic produced predictable results that were precisely directed in cells.

From our initial results, it appears we can elicit an important relationship between intron1 and sequenced proteins of same transcripts. Further that these relationships are non-random and that they can be used to identify the highly specific DNA intron1 sequences that drive this non-randomness.

We previously published the chart below indicating that men1 k-mers ordered into a 15-variant transcript vector were producing length bias despite our algorithm being length agnostic. The scatter-graph is a plot of k-mers (15 variants) by intra transcript-repeats:length (horizontal axis) that gather into vector color bands by charting the k-mers repeats.

On the basis that repeats for (length)ATCG(count) would be expressed as (4)ATCG(3), (3)ATC(1), (3)TCG(1), (2)TC(3) and (2)CG(2), the count for (4)ATCG(3) equates with (2)TC(3).


15 variant men1 Intron1 Transcript - kmer repeats

Relying on the unique ordering for each variants k-mer's in the transcript vector, we made selections of TP53 k-mers where variant order in vectors most significantly changed compared to the previous vector.  For this we discovered that most disrupted vectors were caused by k-mers of very low lengths. Further, in comparison almost all vector positions in most vectors remained stable.

12 TP53 Intron1 Transcripts

Ordering in our vectors is a way to represent transcript k-mers where computation is sequential from the first oligo of intron1. Each k-mer contributes its vector ordering based on its relationship to the transcripts' constant, protein or mRNA signature for each variant.





Sunday, January 15, 2017

Influential ncRNA Concentrations

Concentration dynamics influence DNA-Protein-RNA-Protein interactions in both the nucleus and cytoplasm are a significant influencer of activity. It is becoming increasingly evident that extremely small changes, often in a small fraction of concentrations that results from dosage sensitive genes can be immediately significant.
Subcellular concentrations are affected by elements including  DNA, RNA, protein, metabolites, molecules and water. This concentrated environment can determine binding of particular elements in the pressure of condensed space. The effects on intrinsic protein disorder, including p53 is also concentration dependent and dosage sensitive. Double stranded DNA association and dissociation, as influenced by epigenetic factors and induced by proteins may play a role in organizing DNA’s subsequence concentrations to be more or less favorable to binding protein, transcription efficiency and translation.


Intra chromosome, any sub-sequences of the DNA molecule can be thought of as concentrations. Each gene is then a particular pattern of inherent purine-pyrimidine concentrations, proximal definitions on DNA of the chromosome. Therefore, our interest in DNA concentration specifics led us to review every adjacent combination of any of the four letters A,G,C or T within a gene of interest. Our DNA pattern algorithm computes every possible subsequence >7 letters of a DNA sequence. It is sensitive to the smallest single nucleotide change in any subsequence >7 for all possible sub-sequences of a transcript.

From the transcribed DNA sequence GAGCTTCGAG6 there can momentarily exist, as RNA derivative sub-sequences or sub-concentrations >7 GAGCTTCG3| AGCTTCGA4| GAGCTTCGA4| GCTTCGAG3| AGCTTCGAG4|:- The subscript is the maximum frequency the subsequence can be discovered in any >7 derivations of GAGCTTCGAG. Since order of letters in DNA is fixed, our model represents the complete, potential set of weighted RNA segments that can be transcribed into the nucleus and cytoplasm.

Because various concentrations of repeatedly transcribed DNA sequences will exist as RNA in the cell at various times, this is an exhaustive model against which a single gene-cell potential state can be computed. Measuring and sequencing gene-cell cytoplasmic RNA concentrations and comparing them to the gene-cell potential state provides a comparative. Using this as a diagnostic may be a sensitive measure to associate non-coding DNA-RNA states with diseases.

Tuesday, December 6, 2016

k-mer wobble for TP53 signatures

We analyzed data from 12 - P53 transcripts by following 2765 theoretical k-mers selected because they started with the same two letters in every transcript. We sorted these by their Sequence Start (in the origin DNA), iScore (col C) then by k-mer recurrence in the transcripts' bigger length k-mers (col J) [column reference is to the linked file].

In the chart below, transcript k-mers that demonstrated sequential position in iScore vector (PiV) stability produced uninterrupted lines. k-mer stability/instability is a function of iScore (discussed above) and plot of the sequence of k-mer PiV [value's].

We plotted each k-mer PiV for each transcript. For each transcript, each k-mer can occupy one of twelve positions and each PiV for all transcripts must be occupied 2765 times (see last chart below)




At row 2459 above (top left), there is a single line perturbation (not visible in this image). We traced it to k-mer (#939134) where we discovered that the addition of "G" at nucleotide 1377 caused Transcript T793 to change order with T597, T465. The result of "G" was an iSocre change that caused the order of transcripts to flip.




The next image demonstrate the total PiV (row1) distributions for all 2765 k-mers of each transcript (col N).



Thursday, December 1, 2016

Predicting Reverse Complements, Inverted Repeats and Mites

Reverse complements and inverted repeats are one of the most compelling pieces of evidence supporting a Codondex variable k-mer computation. The table below for men1 transcript 702, highlights some of our findings, which have been repeated. 

Two main points to keep in mind: Data on left and right of the table were sorted on different column parameters. Column B-K was sorted according to Col L, count of the Reverse Complement ("RCk-mer") discovered in the k-mers listed in column E. Column N-X was sorted according to column V, count of the k-mer found in all k-mers (column E). These are Reverse Complements and Inverted Repeats. 






Things get interesting in row 22/28. These are the k-mers and their inverted repeats ordered by column V, the k-mer count. Row 31 identifies the base sequence start and end position of the k-mer:

@D31 sequence start 0...7 end
@F31 sequence start 1...8 end
@H31 sequence start 0...8 end
@J31 sequence start 2...9 end
@L31 sequence start 1...9 end
@N31 sequence start 0..9 end, from this you can see that all these previous k-mers were included, making them redundant for this purpose. 

The Position (in) iScore Vector (PiV) (Row 25 and 30) of each of these k-mers = 11. The PiV for each RCk-mer = 9 or 10. There is no sequence overlap in the k-mers of columns P-X.

We decided to test whether the individual RCk-mer count in each k-mer could be predicted. In this transcript there are 227,475 individual k-mers which contain a total 1,563,308 RCk-mers (inverted repeats) - this occurs because multiple, shorter length RCk-mers may be found in the same longer length k-mer. We developed an algorithm to establish a +80% accurate training set and used Random Forrest to predict the counts for the balance.

Without sequence text, using only iScore data, we accurately predict RCk-mer counts in 99.5-100% of k-mers for any given gene transcript.















Tuesday, November 29, 2016

Position (in) iScore Vector exposes Intron - Protein Relationships?

16,904,172 theoretical k-mers were computed for 12 transcripts of Tp53 downloaded from ensembl. Each transcripts' k-mer set contains 1,408,681 entries computed in the same nucleotide order. k-mers were associated with a signature of the transcript's protein. The protein signatures were computed using the same process.

As previously described a ranking based on recurrence was computed for each k-mer using iScore algorithms. We used a vector which we call PiV to assign each k-mer to a position in the order of the 12 k-mers each for a single transcript. 

The chart represents, for each transcript the distribution of its k-mers to positions 1-12 based on the iScore and protein signatures. For the 1,408,681 k-mers, at least one position in the vector is dominant in each transcript. 




Sunday, November 27, 2016

k-mers, Vectors, ncDNA (Intron) and Protein Relativity


Following a particular theory of distributed computing, we developed a method using vectors to expose relative differences in the sort order of subsequence's (k-mers) of transcripts of non-coding DNA. Our expectation is that vectors, which have been used successfully to partition data and determine redundancy can also be used to isolate ncDNA k-mers that played a role in the protein the genes coding DNA encoded. 

We first associated each k-mer with the transcript's protein/m-RNA sequence signature. Then, we computed the vector at each [k-mer/protein signature] for all transcripts. We referenced each at the same start or end position in their base sequence. This provided a novel comparison for ranking k-mers of transcripts. 

There are two ways to derive a vector from a k-mer. Transcripts in the image below labeled T1-T4 represent two different ordering methods. The first (left) is a constant because every k-mer of each transcript will have the same protein signature. The second (right) is established using the i-Score of the k-mer. A single application of a max-order rule is applied to re-order transcripts with equal value iScore's provided the re-order equates with a protein signature order.



The image above represents a k-mer of four different transcripts (left) protein signature sort yields k-mer iScore vector (T3,T2,T4,T1) different to protein signature vector (T2,T3,T4,T1). We refer to each transcript k-mers position in this as Protein hash Vector (PhV). On the right the iScore vector (T1, T2, T3, T4) based on k-mer sort, adjusted for max-order to protein signature. We refer to transcript position as Position (in) iScore Vector (PiV). Using these vectors and their differences, we rank k-mer/protein entries for each transcript in the set.

The actual result for the 1000th k-mer of 15 men1 transcripts can be seen here. Each k-mer can be queried independently. 





A new powerful visualization of DNA intertranscript Sequences

Visualizations are very important in interpreting and understand data sets. Often with bioinfomatics it is challenging to visualize certain elements in DNA sequences. Using our methods one can use a powerful technique which can be tuned to ones needs to visualize DNA sequences.

To view and analyze differences in the transcripts of the same gene can be an important tool to compare different DNA sequences. To support this, the advances of image processing and machine learning can further be used to gain insight and find patterns in a large scalable fashion.

To visualize our data all one needs to do is find the center of each sequence to indicate that sequence middle or average position.

For example if one has a sequence which starts at position 310 and ends at 350 then the middle is (310 + 350)/2 = 330. For each transcripts one can plot the Position in iVector value relative to the sequences center value. One can also add another dimension like Diff as well if they like as the colour variable of each point.

The images below show the results for the 6 SET ensemble transcripts and the 15 MEN1 ensemble transcripts. In these images the length maximum was 50.

SET Transcripts



MEN1 Transcripts


In the image above one can see the MEN1 transcripts. Certain groupings were made based on image similarity. Also the numbers indicated transcripts with the same protein amino acid sequences. There are 5 - ones (1) and 3 - twos (2) and 2 - threes (3). Once can see that for sequences that have the same protein sequence there exists a similarity in the images.