Understanding an unknown protein sequence provided that was found to be [Physcomitrella patens] >gi|162695322|gb|EDQ81666.1| Seq ID: gi|168001481|ref|XP_001753443.1S
A protein sequence was provided to be subjected to the manipulation of various bioinformatics tools. The protein was found to be a predicted protein of Physcomitrella patens that is an important bryophyte that is utilised in general genomic studies such as reverse genetics, molecular farming, the production of knockout mosses that were to be used in the biopharmaceutical production. As it is a predicted protein, its functionality was yet to be determined or understood and so multiple bioinformatics tools were employed to understand this protein and its place in the genome of the moss plant Physcomitrella patens. Methods such as BLAST, PSORT, Phyre2, ClustalW2, Jalview and several others were used in identifying further information regarding this protein.
An unknown protein sequence given was found to be a predicted protein of Physcomitrella patens. Physcomitrella patens is a bryophyte and is known to be one of the few multicellular organisms that have a high efficiency in homologous recombination (Schaefer and Zryd, 1997) in which a unique approach called reverse genetics can be engineered by targeting an exogenous DNA sequence to a chosen genomic position in order to create knockout mosses. This approach is an important tool in order to study the function of genes and can further be used to understand the molecular evolution of plants. The moss, Physcomitrella patens, show alterations of generations but have a dominant gametophyte stage thus making genetic studies ideal. The first study using this plant was done in 1998 (Strepp et al., 1998). P. patens has been increasingly used in biotechnology where moss genes have been identified to have connections with the improvement of crops and well as human health. It has also been important in studies relating to the safe production of biopharmaceuticals in moss bioreactors (Reski and Frank, 2005). P. patens genome has 500 mega base pairs that are organized into a total of 27 chromosomes, assembled into 2106 scaffolds and were sequenced completely in 2006. Using multiple gene knockouts, the Physcomitrella plants were engineered to have lacked a certain glycosylation step. These knockout mosses are used in molecular farming in order to produce biopharmaceuticals (Koprivova et al, 2004). The ability of researchers to utilise RNA interference methods and also gene targeting has allowed the study of gene function and the subsequent study of the protein function using the many bioinformatics tools in order to perform functional and comparative genomic studies that include expressed sequence tags has brought light to genetic analysis. P. patens is easily cultured and as mentioned above is majorly in its haploid state, this thus allows the experimentation with techniques that have already been utilised in yeasts and microorganisms. P. patens has a simple developmental stage where it only has a few tissues that have a small amount of cell types and also lack vascular tissues and true roots, leaves and stems. Despite this, signalling pathways that have been identified in angiosperms have also been noted in this moss. As an example, auxin, cytokinin, abscisic acid are all found in the pathways associated with the development of phenotypes of P. patens. Photomorphogenic pigments such as phytochrome and cryptochrome are also linked in this pathway. Due to a high number of genomes in this plant contains genes that encodes for protein with currently no known function, it is very likely that the functions of those genes and proteins could be discovered with the various bioinformatics tools that are in our disposal today. In this study, an analysis was done on the protein sequence provided using various bioinformatics methods in order to find its likely function, a possible structure, a model for the said structure, whether a hypothetical structure has already been determined by a structural genomics consortium.
The sequence provided for this investigation is as follows
In order to gain important information regarding the given protein sequence, multiple bioinformatics tools were employed. The first tool to be used was BLAST (to be specific, protein blast) that was used to find the identity of the unknown protein sequence. Subsequently, UniProt was used to investigate the protein function and was also analysed by Prosite and Pfam databases. A multiple alignment test was performed on the protein found, predicted protein [Physcomitrella patens] >gb|EDQ81666.1| predicted protein [Physcomitrella patens]’ NCBI ref: XP_001753443.1 with the closest sequence using the ClustalW2 query engine. Multiple alignments to the predicted protein (query protein) were done using the ClustalW2 query database. The PSORT database was utilised to identify the likely protein subcellular location. A Tax BLAST report was used to identify the taxonomy of the protein and its closely related organisms. The Phyre2 webpage was used in obtaining a 3D possible model of the protein. A tax BLAST report was generated in order to understand the taxonomy of the protein. The secondary structure of the protein was obtained with the same Phyre2 search page. The results are explained and are supported by the literature found in multiple articles.
Results and Discussion
By analysing the protein sequence given using BLAST, it is understood that the given sequence encodes for the predicted protein of Physcomitrella patens. The full identification name given in the results page is the predicted protein [Physcomitrella patens] >gi|162695322|gb|EDQ81666.1| Seq ID: gi|168001481|ref|XP_001753443.1 . It also gives rise to the fact that the sequence given is not the full sequence of the protein as the identity value of the protein is 72%. The e-value associated with this protein is 2e-71 which is low.
Figure 1: (A) Graphical summary of the distribution of BLAST hits on the unknown sequence entered. (B) Several sequences that had produced similar alignment with the queried sequenced according to the query score. (C,D) Sequence alignment performed via ClustalW and BLAST of queried sequence showing matching 72% identity value and a low e-value of 2e-71
Sequence Similarities and Multiple Alignments
There are multiple sequences that are similar to the query protein sequence with position specific weight matrix that was performed by utilising PSI-BLAST. The two top hits, the predicted protein of Physcomitrella patens, Seq ID: gi|168001481|ref|XP_001753443.1 (the identified one) and its closest similar sequence, predicted protein [Physcomitrella patens] seq ID: gi|168035394|ref|XP_001770195.1, both originating from the moss plant of Physcomitrella patens. Both had similar identity values with the former scoring 72% similarity with an e-value of 2e-71 and the latter scoring 74% similarity with an e-value of 2e-53. It can be concluded that the query sequence does encode for the predicted protein with 72% similarity as it has a matching sequence of 124 out of 172 proteins whereas the second sequence only has 95 proteins that match up. It may also be that the second protein in the similarity match, Seq ID: gi|168035394|ref|XP_001770195.1 may be distantly related to the query protein. A multiple sequence alignment was performed in order to confirm this theory.
Figure 2: (A) PSI-BLAST shows similarity index to query sequence (B) ClustalW2 results showing alignment between predicted protein (Seq ID: gi|168001481|ref|XP_001753443.1) and predicted protein (Seq ID: gi|168035394|ref|XP_001770195.1). (C) the Jalview shows the alignment between both the sequences.
Based on the Jalview tool utilised as shown above, the alignment between the two proteins were determined. Despite this fact, the percentage sequence identity tree could not be quantified. Due to that, NCBI was utilised in order to quantify the distance tree as shown below. This does show that the query protein is the predicted protein with Seq ID: gi|168001481|ref|XP_001753443.1 as we have found out previously. A multiple alignment tool was used to look for proteins with similarities with the predicted protein. Apart from predicted proteins in Physcomitrella patens, similarities were found in hypothetical protein SELMODRAFT_442734 [Selaginella moellendorffii] >gb|EFJ23800.1|. Selaginella moellendorffii is a lycophyte that alongside Physcomitrella patens is an important organism that is modelled in comparative genomics. They are of an ancient vascular plant lineage dating back to 400 million years ago (Shakirov and Shippen, 2012). They have microphylls instead of leaves and represent an important part in the evolution of plants. This also suggests that there was a common ancestor between the two plants.
Figure 3: (A) shows the distance tree and the position of the query protein, also highlights other predicted proteins in Physcomitrella patens. (B) shows constraint based alignment tools that show several other proteins that have similarities between them and the query protein.
Lineage, Related Organisms and Structural homologs
In order to further explore the possibility of common ancestors and shared sequence, a tax BLAST report was obtained from the BLAST page of the query sequence. The lineage report shows the relationship between organisms that appear in the BLAST search according to the taxonomy classification. When a report was generated for the predicted protein of Physcomitrella patens Seq ID: gi|168001481|ref|XP_001753443.1, it is clearly seen that Physcomitrella patens has 4 hits, followed by Selaginella moellendorffii that has 2 hits. The subsequent similarity index shows a histone protein in Desulfococcus oleovorans and pyridoxal-phosphate dependant enzyme in Oceanibulbus indolifex HEL-45. As seen, this report focusses on the organism that yields the strongest BLAST hit. A search was run on Propsearch, which looks for the putative protein family if previous alignment methods have not yielded adequate results. This approach also calculates the molecular weight, bulky residue content, average hydrophobicity, the charge on average and a few dipeptide groups as well are calculated. The sequence that was input, was also transformed into a vector in order to quantify the Euclidian distance in between the query protein and sequences in the database.
Figure 4: (A) Tax BLAST report showing the lineage and organism information of similar proteins. (B)
Location of predicted protein of Physcomitrella patens
PSORT was used to identify the location in which this protein resides primarily. The result showed that this predicted protein is mostly present in the nucleus of the cell in which it scored a 9.0. It can also be found in the cytoplasm achieving a score of 3.0 and only some of this protein is found in the mitochondria with a score of 2.0. The reasoning for this can be strengthened by Nishiyama et al. (2003) that states that a possible predicted protein is involved in the nuclear manipulation of the cells. Lehtonen et al (2013) explains the importance of the predictive protein in the nucleus and cytoplasm and that it is presence in several parts of the moss plant Physcomitrella patens namely the roots, stems and leaves. This protein is also involved in the process of development of the plant itself. Any stress to this protein can result in subsequent phenotypic mutation thus leading to overall protein dysfunction.
Mass of the protein
Mass of the peptide of the predicted protein in Physcomitrella patens. PeptideMass is used to obtain the masses of the generated peptides by cleaving the protein sequence from the UniProt (Swiss-Prot and TrEMBL) knowledgebase or a protein sequence that was added manually. It also is able to compute the theoretical isoelectric point. The results are shown below where it is explained that no missed cleavages had occurred and that the enzyme trypsin was used to perform the cleavage. The average mass is said to be 13811.32 and the monoisotopic mass is 13802.62. It is mentioned that 62.9% of the sequence has been covered in the calculating parameters employed.
Figure 5: (A) shows the sequence entered and the general information of the sequence entered with information on cysteine and methionine (B) explains the molecular weight in average and monoisotopic form.
Structure of predicted protein [Physcomitrella patens]
The sequence for the query protein was analysed using Phyre2, a web based service used for protein structure prediction. This was due to the inability of obtaining a reliable 3D protein model with PSI-BLAST. The figure, as shown below only illustrates the DNA/RNA-binding 3-helical bundle with a percentage identity of 25% where 31 residues have been modelled with a confidence level of 9.7%. A study carried out by Yang et al (2011) does confirm the predicted 3D structure. I was able to obtain a secondary structure for this protein based on Phyre2 as well. The prediction can yield 3 combinations of answers be it α-helix, β-strand or coil. The green helices represent α-helices, blue arrows indicate β-strands and faint lines indicate coil. The ‘SS confidence’ line indicates the confidence in the prediction, with red being high confidence and blue low confidence. In this protein only green helices are noted in the structure that mean only α-helices are present. This corresponds to the data that state that 81% of the secondary structure is composed of α-helices. It is also noted that in the middle of the sequence that the SS confidence line is a combination of green, yellow and orange lines indicating that the helices in the middle have a weaker prediction. About 40% of the strand is composed of disordered regions in the protein. The region in which helix prediction is at its weakest corresponds for the strongest disorder prediction of the protein. This can be observed in the figure below. It is important to know that despite being a disordered region, it is usually functionally of importance. Despite this, due to being disordered it is unwise to predict their structure. The prediction of the secondary structure usually has an 80% accuracy. This accuracy can be achieved only if a considerable amount of diverse sequence homologues are detectable in the database used.
Figure 6: (A) Phyre2 predicted protein model. (B) Secondary structure of the modelled protein.
Several protein-protein interactions of the predicted protein [Physcomitrella patens] have been identified using the STRING tool. Despite the many interactions, I was unable to understand nor identify the function of the other proteins that have been in interaction with the query protein. This is because all the interactions have been between hypothetical proteins whose functions have not been discovered nor understood as of yet. Below is a diagrammatic table showing the protein-protein interaction.
Figure 7: STRING based protein-protein interaction diagram showing interaction between the query protein (Predicted protein [Physcomitrella patens]) and several other hypothetical protein belonging to Physcomitrella patens.
Based on the above compilation of results, it is hard to predict the function of the protein as the protein is a predicted one only. However, studies have been done on the genes responsible for the production of this protein. A paper has studied the usage of plants as factories for the production of therapeutic proteins noted that in Physcomitrella patens, the N-glycosylation pattern required modification due to the immunogenicity of the sugar residues that were specific to this plant. And so the disruption of certain genes were carried out using homologous recombination. Due to the double knockout and corresponding genes that lacked transcripts, lack of certain enzymatic residues indicated that those protein were what caused allergies in biopharmaceuticals. The modifications made here to the gene, described the changed functionality of the produced protein and is key in the elucidating proteins of plant origin in the future it possible that this approach be taken into consideration in further elucidating the predicted protein of Physcomitrella patens that was investigated in this study (Rensing et al., 2008). In another study, a group of Physcomitrella patens were transformed by inserting multiple mutations in all of the genes that were expressed. A moss cDNA library was mutagenized in E.coli using the transposon Tn1000. The subsequent formation of the gene disruption library was used on Physcomitrella patens where homologous recombination of the mutagenized cDNA should target the insertions of the expressed genes (Lehtonen et al., 2013). This approach can also be utilised to further gather information regarding the predicted protein of Physcomitrella patens that was identified in this paper.
The query sequence was identified as predicted protein [Physcomitrella patens] >gi|162695322|gb|EDQ81666.1| Seq ID: gi|168001481|ref|XP_001753443.1. Based on the bioinformatics methods employed in gathering information regarding this protein, I have found that this protein is closely related to another predicted protein [Physcomitrella patens] seq ID: gi|168035394|ref|XP_001770195.1 as well as several predicted proteins in Selaginella moellendorffii.Jalview confirmed that analysis. The taxonomic results also showed that this protein was related to a histone protein in Desulfococcus oleovorans. Using PSORT, it was found that this protein is predominantly located in the nucleus of the cells and is supported by scientific papers. The 3D structure of the protein was modelled on Phyre2 and the secondary structure was elucidated to show that this protein is made up of α helices. Despite the wealth of information gained, it is still not enough to understand the function of the protein due to the predictive nature of the protein. Further hands on research needs to take place before the function can be known and understood.