Monday, November 26, 2018

Mathematical vectors in biology

Mathematical vectors in biology!

We built a model to determine whether a random or non-random relationship existed between introns and proteins of transcripts. We determined the relationship was overwhelmingly non-random and progressed to study particular genes in more detail.

Based on our studies we suggested TP53 readily encodes specific isoform concentrations that alter next generation transcriptions and introns play significant roles. Here we validate our selection logic and describe its proposed use in immunotherapy. From intron1, we computed +400k k-mers from which we selected 8 short k-mers out of 12 TP53 and 29 BRCA1 transcripts. We synthesized the sequences and In subsequent transfection experiments 3/3 TP53 and 3/5 BRCA1 significantly (p<0.05) reduced the rate of proliferating HeLa cells.

Selection Background

For each transcript we first computed intron1 k-mers greater than 7 oligos (see image below, each k-mer has an Offset#). For all k-mer’s we computed a signature and associated it with a signature of the transcripts’ protein. In Offset# order, for each k-mer of each transcript we ordered transcripts according to the result of a k-mer:protein signature ordering.  For each offset# (k-mer) we recorded the order of each transcript in a vector.

In offset# (computation) order, we observed the next vector to discover any changes in ordering of transcripts. After filtering k-mers for a length change, more than 90% of transcript ordering remained stable. Occasionally one or two transcripts changed position, very rarely more than 75% of transcripts in the vector changed position. So when we discovered a few vectors with >75% change we extracted them and subjected them to a selection algorithm that  identified 8 short, 28 oligo sequences from the 41 transcripts processed.

Codondex iScoreTM ordering, comparison and selection algorithms consider that transcripts compared at sequential k-mers represents a compelling method to identify sequences that “stand out from the crowd” because they may be inherent upstream of transcription. Potential of any k-mer exists to aggregate or contribute to the formation of coacervates in a sequence and length dependent manner. In the image above red text represents our computation of the first 14 of 135 potential k-mer’s of the identical 23 oligo sequence. For each k-mer all k-mers of the 23 oligo sequence would be queried (in both directions) and repetitions counted. For example of the 14 k-mers, the k-mer at Offset#0 can also be found in Offsets#2,5,9 and 14.

In the compound computation of the 23 letter sequence, Offset#0 GTGGGAAT is repeated in 16 other k-mers and Offset#135 GTGGGAATCTTATCCATGACCCA has 136 k-mers repeated in it (including itself). When looking at the entire intron sequence (or any long sequence) there is never a linear progression of k-mers, inevitably the counts becomes disordered.

In the following example of ordering transcript computations for a single Offset#, each result has been ordered in a vector of 15 men1 transcripts. Each protein signature is constant for every Offset# because the signature is computed from the entire string. Some protein signatures are identical, but not intron signatures. Transcripts with identical protein signatures are preferentially sorted to give final order to transcripts in the vector.

15 transcripts, for a single k-mer (Offset#) in a vector
In our detailed review of each transcript we discovered that the compound effect of k-mer repeats described an inherent structure of relationships between nucleotides lengths. We considered how varied transcription events would alter the representations of these non-coding oligo lengths in their ncRNA form. For example, Offset#135 included 136 repeats of k-mers, which statistically infers it has a greater chance of survival and/or function in any of its constitutive parts than a k-mer with a lesser number of repeats.

As stated in the opening paragraph we synthesized 8 short RNA selections we made using our vectors to discover how they translated in biology. In future we intend to compute p53 (or other gene) transcripts from multiple samples of a patient biopsy. We do this by separating cells into multiple wells, running RNAseq on each well and computing the transcript position of each well in our p53 vectors. Once we identify the logarithmic proximity of each wells transcript to other transcripts we will select a well. We will use selected cells to educate natural killer cells extracted from the patient and return the immune cells only to the patient to reduce proliferation of diseased cells. We hope to bring this therapy to the market in the next few years.

Sunday, September 23, 2018

∪k-mer - knockout!

We hold a view that DNA sequences inherently follows strict rules of diffusion, spliced introns function on protein non-randomly and sequence order as well as length remain critical upstream of transcription. This formed the basis of our original data exploration in which we first proved the non-random relationships between intron's and the ultimate protein product of its gene.

In subsequence experiments we used our bioinformatic to identify a specific profile of HeLa cell sequences that reduced proliferation.  To discover these sequences, we computed more than 400,000 intron1 k-mers from 41 transcripts, identified and transfected 6 short sequences that significantly reduced the growth of these cervical cancer cells. From our RNA-seq analysis growth was reduced by shifting replicative senescence to apoptosis accompanied by mitcochondrial hyper-function.

For transcript (T), we looked at any sequence (S) as having cell-wide potential to define or be re-defined during or post transcription. However, as DNA of a gene, S must be considered to possess a unique potential that will effect its interactions. Science knows very little about the catalog of possible effects that can be attributed to S. To model its potential qualities we first look to its structural arrangements for any length (n>7).

In our examples below, for S1=TGTGGGCCCACA and S2=GTGGGCCCAGAC we computed every possible k(n>7) and count unions of k - ⋃k.

The Unions  of k for S1 and S2
Why is this important? Let's say you want to tie this computation to biological function, the base data would have to represent the myriad underlying biological possibilities. To do this you would need very low level analytical resolution, some of which is represented in the images above.

At Codondex we take a further step by associating every k with a hash signature (#s) of the transcript protein or mRNA. You will appreciate for T that #s is constant and associated with every k(n>7)  input to tuple k(n>7...S):#s. The combination of each intron 1 k-mer with its transcript protein signature becomes the input to a highly ordered vector used to compare order stability, in the vector at the next nucleotide of k-mer:protein signatures for multiple transcripts of the same gene.

Our results show at the next nucleotide that order changes, in the vector are rare and orient dramatically to k-mer's of shorter lengths. Further, that the transcripts with the most changes in vector positions are definitive for sequence selections that confer their anti-proliferation effect in transfection experiments.

We anticipate cells approximating these k-mer:protein anti-proliferation signatures will be most useful in the fight against proliferating diseased cells. To test usefulness we will co-culture these cells to precondition Natural Killer cells with conforming and non-conforming receptor-ligand relationship sensitivity. These Natural Killer cells will be tested against HeLa to determine whether recognition and optimal immune response can be triggered.

Stay posted for more updates on our exciting discoveries and computations for minimally manipulated autologous therapy against patient disease.

Monday, August 13, 2018

Deep k-mer - new dimension analysis

Sometime back we published a paper that expanded on TP53 intron1 k-mer relationships we have been investigating using our algorithms. We described how TP53 response elements bind p53 monomers and more complex response elements bind and form P53 tetramer's that support known transcription events.

Since that time we have been conducting numerous laboratory tests in conjunction with Professor Noam Shomron at TelAviv University to confirm that sequences we identified from our TP53 bioinformatic produced predictable results that were precisely directed in cells.

From our initial results, it appears we can elicit an important relationship between intron1 and sequenced proteins of same transcripts. Further that these relationships are non-random and that they can be used to identify the highly specific DNA intron1 sequences that drive this non-randomness.

We previously published the chart below indicating that men1 k-mers ordered into a 15-variant transcript vector were producing length bias despite our algorithm being length agnostic. The scatter-graph is a plot of k-mers (15 variants) by intra transcript-repeats:length (horizontal axis) that gather into vector color bands by charting the k-mers repeats.

On the basis that repeats for (length)ATCG(count) would be expressed as (4)ATCG(3), (3)ATC(1), (3)TCG(1), (2)TC(3) and (2)CG(2), the count for (4)ATCG(3) equates with (2)TC(3).


15 variant men1 Intron1 Transcript - kmer repeats

Relying on the unique ordering for each variants k-mer's in the transcript vector, we made selections of TP53 k-mers where variant order in vectors most significantly changed compared to the previous vector.  For this we discovered that most disrupted vectors were caused by k-mers of very low lengths. Further, in comparison almost all vector positions in most vectors remained stable.

12 TP53 Intron1 Transcripts

Ordering in our vectors is a way to represent transcript k-mers where computation is sequential from the first oligo of intron1. Each k-mer contributes its vector ordering based on its relationship to the transcripts' constant, protein or mRNA signature for each variant.





Sunday, January 15, 2017

Influential ncRNA Concentrations

Concentration dynamics influence DNA-Protein-RNA-Protein interactions in both the nucleus and cytoplasm are a significant influencer of activity. It is becoming increasingly evident that extremely small changes, often in a small fraction of concentrations that results from dosage sensitive genes can be immediately significant.
Subcellular concentrations are affected by elements including  DNA, RNA, protein, metabolites, molecules and water. This concentrated environment can determine binding of particular elements in the pressure of condensed space. The effects on intrinsic protein disorder, including p53 is also concentration dependent and dosage sensitive. Double stranded DNA association and dissociation, as influenced by epigenetic factors and induced by proteins may play a role in organizing DNA’s subsequence concentrations to be more or less favorable to binding protein, transcription efficiency and translation.


Intra chromosome, any sub-sequences of the DNA molecule can be thought of as concentrations. Each gene is then a particular pattern of inherent purine-pyrimidine concentrations, proximal definitions on DNA of the chromosome. Therefore, our interest in DNA concentration specifics led us to review every adjacent combination of any of the four letters A,G,C or T within a gene of interest. Our DNA pattern algorithm computes every possible subsequence >7 letters of a DNA sequence. It is sensitive to the smallest single nucleotide change in any subsequence >7 for all possible sub-sequences of a transcript.

From the transcribed DNA sequence GAGCTTCGAG6 there can momentarily exist, as RNA derivative sub-sequences or sub-concentrations >7 GAGCTTCG3| AGCTTCGA4| GAGCTTCGA4| GCTTCGAG3| AGCTTCGAG4|:- The subscript is the maximum frequency the subsequence can be discovered in any >7 derivations of GAGCTTCGAG. Since order of letters in DNA is fixed, our model represents the complete, potential set of weighted RNA segments that can be transcribed into the nucleus and cytoplasm.

Because various concentrations of repeatedly transcribed DNA sequences will exist as RNA in the cell at various times, this is an exhaustive model against which a single gene-cell potential state can be computed. Measuring and sequencing gene-cell cytoplasmic RNA concentrations and comparing them to the gene-cell potential state provides a comparative. Using this as a diagnostic may be a sensitive measure to associate non-coding DNA-RNA states with diseases.

Tuesday, December 6, 2016

k-mer wobble for TP53 signatures

We analyzed data from 12 - P53 transcripts by following 2765 theoretical k-mers selected because they started with the same two letters in every transcript. We sorted these by their Sequence Start (in the origin DNA), iScore (col C) then by k-mer recurrence in the transcripts' bigger length k-mers (col J) [column reference is to the linked file].

In the chart below, transcript k-mers that demonstrated sequential position in iScore vector (PiV) stability produced uninterrupted lines. k-mer stability/instability is a function of iScore (discussed above) and plot of the sequence of k-mer PiV [value's].

We plotted each k-mer PiV for each transcript. For each transcript, each k-mer can occupy one of twelve positions and each PiV for all transcripts must be occupied 2765 times (see last chart below)




At row 2459 above (top left), there is a single line perturbation (not visible in this image). We traced it to k-mer (#939134) where we discovered that the addition of "G" at nucleotide 1377 caused Transcript T793 to change order with T597, T465. The result of "G" was an iSocre change that caused the order of transcripts to flip.




The next image demonstrate the total PiV (row1) distributions for all 2765 k-mers of each transcript (col N).



Thursday, December 1, 2016

Predicting Reverse Complements, Inverted Repeats and Mites

Reverse complements and inverted repeats are one of the most compelling pieces of evidence supporting a Codondex variable k-mer computation. The table below for men1 transcript 702, highlights some of our findings, which have been repeated. 

Two main points to keep in mind: Data on left and right of the table were sorted on different column parameters. Column B-K was sorted according to Col L, count of the Reverse Complement ("RCk-mer") discovered in the k-mers listed in column E. Column N-X was sorted according to column V, count of the k-mer found in all k-mers (column E). These are Reverse Complements and Inverted Repeats. 






Things get interesting in row 22/28. These are the k-mers and their inverted repeats ordered by column V, the k-mer count. Row 31 identifies the base sequence start and end position of the k-mer:

@D31 sequence start 0...7 end
@F31 sequence start 1...8 end
@H31 sequence start 0...8 end
@J31 sequence start 2...9 end
@L31 sequence start 1...9 end
@N31 sequence start 0..9 end, from this you can see that all these previous k-mers were included, making them redundant for this purpose. 

The Position (in) iScore Vector (PiV) (Row 25 and 30) of each of these k-mers = 11. The PiV for each RCk-mer = 9 or 10. There is no sequence overlap in the k-mers of columns P-X.

We decided to test whether the individual RCk-mer count in each k-mer could be predicted. In this transcript there are 227,475 individual k-mers which contain a total 1,563,308 RCk-mers (inverted repeats) - this occurs because multiple, shorter length RCk-mers may be found in the same longer length k-mer. We developed an algorithm to establish a +80% accurate training set and used Random Forrest to predict the counts for the balance.

Without sequence text, using only iScore data, we accurately predict RCk-mer counts in 99.5-100% of k-mers for any given gene transcript.















Tuesday, November 29, 2016

Position (in) iScore Vector exposes Intron - Protein Relationships?

16,904,172 theoretical k-mers were computed for 12 transcripts of Tp53 downloaded from ensembl. Each transcripts' k-mer set contains 1,408,681 entries computed in the same nucleotide order. k-mers were associated with a signature of the transcript's protein. The protein signatures were computed using the same process.

As previously described a ranking based on recurrence was computed for each k-mer using iScore algorithms. We used a vector which we call PiV to assign each k-mer to a position in the order of the 12 k-mers each for a single transcript. 

The chart represents, for each transcript the distribution of its k-mers to positions 1-12 based on the iScore and protein signatures. For the 1,408,681 k-mers, at least one position in the vector is dominant in each transcript.