Saturday, February 27, 2010

Bioinformatics: Different methods used to build phylogenetic trees

ADVERTISEMENTS

Bioinformatics: Different methods used to build phylogenetic trees

In Bioinformatics there are three major methods used in building phylogenetic trees, every one of these methods have its own weaknesses and strengths as the case with every bioinformatics program or method.

These methods are:

1- Distance methods: In this method the algorithm takes the data (sequences) and construct a distance matrix between each 2 sequences, after that the sequences are regrouped depending on their relative distance, the last step is to construct a tree that matches this data.

2- Parcimony methods: This method searches in all possible phylogenetic trees that needs the minimum number of substitutions of nucleic acids or amino acids (mutations), so the best tree is the one that have the minimum number of mutations.

3- Likelihood methods: This method means that the best estimate of a parameter is that giving the highest probability that the observed set of measurements will be obtained.

Bioinformaticians say that Likelihood methods are the most accurate and the best, because most researchers use them, but the problem is that they run very slow because of their long algorithms.

Parcimony methods have great results but they have probably the same negative side of Likelihood methods.

Distance methods or distance based trees are easy to set up, and you can apply them in most situations, but they aren't necessarily the most accurate.



How to prepare your sequences for a phylogenetic tree


What Phylogenetic Trees can do for you?



Any question comment.

Wednesday, February 24, 2010

Bioinfrmatics:video tutorial:using Genomescan to parse genomes (find exons)

ADVERTISEMENTS

Bioinfrmatics:Video Tutorial:using Genomescan to parse genomes (find exons)

In this video tutorial we are going to see how to use Genomescan to parse large DNA sequences and find coding regions or Exons.

As you know higher organisms genes like vertebrates are more complex then others, because they contain coding regions called Exons and between these Exons we find non coding regions called Introns.

To predict these genes which contain several Exons, you have to use a very sophisticated algorithms, that can locate Exons and Introns and by that locating genes.

You can read this post (Open Reading Frame (ORF)) to understand what are ORFs.

You  can read this post (Using ORF Finder to locate open reading frames) for a basic software that can find ORFs.

You can read this post (Sophisticated ORF prediction with GenMark) for a more sophisticated ORF prediction software.

Sunday, February 21, 2010

Bioinformatics: What Is PSI-BLAST?

ADVERTISEMENTS

Bioinformatics: What Is PSI-BLAST?

PSI-BLAST (Position-Specific Iterative BLAST) is a software designed for proteins, and it's a BLAST search that uses a PSSM (position-specific scoring matrix).

What is PSSM?

PSSM (position-specific scoring matrix) is a matrix used for biological data, and its main role in PSI-BLAST search is to increase the sensitivity of results.

PSI-BLAST search uses PSSM as a query instead of individual sequence, it's like a matrix constructed from a multiple sequence alignment and then each position of the alignment will have its own position specific score.

How PSI-BLAST works?

It begins with a normal BLAST search (the more match, the more score), but in this case a regular BLAST search will probably miss more distant and may be interesting homologies, so next PSI-BLAST will construct a PSSM (position-specific scoring matrix) and repeat the search until no new matches are found, this will result in finding new distant sequences that you are may be interested in.

You can read this post (Different Blast Programs) to understand all types of BLAST programs including PSI-BLAST and what each one do.

You can access PSI-BLAST from EBI website HERE.

Friday, February 19, 2010

Bioinformatics: Perl and BioPerl

ADVERTISEMENTS

Bioinformatics: Perl and BioPerl

As you all know, Bioinformaticians are 2 types:

1- That use ready softwares to analyse biological data.

2- That design new softwares for them or for other Bioinformaticians.

As we discussed on earlier post about The best programming language for bioinformatics HERE, we said that Perl (Practical Extraction an Report Language) is the most powerful because:

1- It is installed or included in almost every Linux distribution.

2- The scripts written by Perl doesn't require compilation (They are portable from one system to another).

3- It supports regular expressions (a very powerful controle and manipulation of strings).

4- And what makes it very unique programming language comparing to others, its support to Hashes or Table Hashes (association of values with keys).

5- It contains an unlimited number of ready modules on internet that anybody can use.

6- It is available also for Windows.

You can read this post about the best book to begin programming with Perl for bioinformatics called Beginning Perl for Bioinformatics.

What is BioPerl?

BioPerl is a project developed by Open Bioinformatics Foundation and is a collection of modules that you can use to easily contruct Perl scripts to automate tasks for bioinformatics.

With BioPerl you don't have to do anything from scratch, so you use ready modules that suites your needs (what do you want more than that???).

In my opinion i see that Perl is the best programming language for bioinformatics, if you have a different point of view, you can suggest it in comments.

Tuesday, February 16, 2010

Bioinformatics: Sophisticated ORF prediction with GenMark

ADVERTISEMENTS

Bioinformatics: Sophisticated ORF prediction with GenMark

Orf prediction programs are a key to locate ORFs (Open Reading Frames), and if we locate ORFs we have an approximative idea of the location of your gene that is coding for a protein.

To read about ORFs or Open Reading Frames click HERE.

In the how to work with ORF finder program to predict ORFs video tutorial, i've showed you how to use ORF Finder program developed by NCBI to locate ORFs, but i've said that this software is very basic, so we can use it only with simple genomes (Viral, Bacterial...etc), bacause these kind of programs can identify only about 80 percent of Protein Coding regions that you may be interested in.

You can see ORF Finder video tutorial HERE.

In this video tutorial i'm going to show you a more sophisticated approach that can predict ORF of (Bacteria, Viruses, Eucaryotes...etc), this software is a familly of different programs that use a very sophisticated method.

Sunday, February 14, 2010

Bioinformatics: Linux Vs Windows (What's Better For Bioinformatics)?

ADVERTISEMENTS

Bioinformatics: Linux Vs Windows (What's Better For Bioinformatics)?

People have 2 big choices when it comes to use operating systems especially Bioinformaticians, Linux and Windows, but there is a huge difference between these 2 operating systems.

Windows:

Windows is known for its simplicity (Anyone with a basic knowledge can work with windows), so it's user friendly, great interface, great media support, but it is less adapted to Bioinformaticians needs and:

- Its not free.
- Its source is not open to buplic.
- Most of its softwares are not free.
- You can't automate instructions...etc


I'm not saying that Windows isn't good for you, because i work with it most time, but if you are a Bioinformatician and you want to program new softwares or automate some instructions, than Linux is definitely for you, if you want to use ready softwares to analyse your data you can use Windows.

UNIX (Linux):

Linux is a very powerful operating system especially for programmers because it gives you full controle over your machine:

- It has a lot of programming tools (languages and interfaces).
- Other free softwares as (Webservers, Database management system, visualisation softwares, text editing...etc).
- Statistic analysis (like R).
- Unix is more stable and runs fast.
- Vast ducumentation for softwares (How to use stuff!!!).

So if you are a bioinformatician that is more likely attracted to biology (you use bioinformatics softwares only for analysing your biological data) then you can use Windows, but if you are a cyber geek!!! that wants to develop new softwares for bioinformaticians then you can use Linux, i personally prefer Linux but in the end it's up to you to decide.


If you want to use BioLinux 5.0 you can read a post abou it HERE.

If you want to know how to have BioLinux 5.0 working on you computer you can read this post HERE.

If you have any question, put it in comment.

Friday, February 12, 2010

Bioinformatics: OMIM (Online Mendelian Inheritance in Man) Overview

ADVERTISEMENTS

Bioinformatics: OMIM (Online Mendelian Inheritance in Man) Overview

While bioinformatics is a key in analyzing genes, proteins, genomes, mutations...etc, researchers use these information to understand genetic diseases especially in human, this is where bioinformatics is playing a major role in finding, analyzing, and treating these genetic disorders and for that NCBI has developed OMIM and made it available to public.

OMIM (Online Mendelian Inheritance in Man) is a database which contains a catalog for human genes and genetic disorders, the database was developed by NCBI (National Center for Biotechnology Information) and it is hosted on their server.

OMIM contains information about all known genetic disorders and it links to other resources like MEDLINE (Citations and abstracts) and even links to other NCBI databases entries that are responsible for certain diseases.

OMIM has three ways to search for genetic disorders or related information:

1- Through a normal search: by typing a keyword like in the case of most databases.

2- By using the Gene map: where you can browse a table of genes organized by cytogenetic map location.

3- By using the Morbid map: which is a table of all alphabetically listed genetic disorders featured in OMIM.

To access OMIM click HERE.

Any questions you're welcome.

Wednesday, February 10, 2010

Bioinformatics: Using ORF Finder to locate open reading frames

ADVERTISEMENTS

Bioinformatics: Using ORF Finder to locate open reading frames

In this video tutorial, i'm going to show you haw to use the ORF Finder software to find or locate open reading frames (possible protein coding genes).

ORF Finder is a software located at the NCBI Website and it is designed to locate open reading frames in a given DNA sequence in all the six reading frames.

To know more about Open Reading Frames,you can read this post HERE.

Note: This software (ORF Finder) is a basic software, so you can use it in the case of non complex genes (Microbial genomes).

There is a more sophisticated softwares that can handle the complexity of higher organisms genomes like GenMark.

Monday, February 8, 2010

Bioinformatics: What is MEDLINE and PubMed?

ADVERTISEMENTS

Bioinformatics: What is MEDLINE and PubMed?

Researchers citations are very important in any research at any given field. For bioinformatics these citations are indispensable in any research, for this reason the united states National Library of Medicine (NLM) is providing biomedical literature to researchers or students online.

Since 1879, the NLM has published the Index Medicus which is an index or guide to articles, but with the evolution of information technology, Index Medicus has became a database now known as MEDLINE.

What is MEDLINE:

MEDLINE or (Medical Literature Analysis and Retrieval System Online) is a huge bibliographic database that contains articles from academic journals covering : biology, all branches of medecine, health, molecular biology, biochemistery, microbiology...etc, this data is accessible free over the internet via PubMed.

What is PubMed?

PubMed is a part of Entrez retrieval system, and is a search engine or retrieval system to access MEDLINE citations, abstracts, and full text articles. In addition to MEDLINE citations which are the most found by PubMed, PubMed provides access to other records including in-process citations, some life science journals that submit full text to PubMed Central and may not have been recommended for inclusion in MEDLINE.

As we said before PubMed is part of Entrez retrieval system which is part or the NCBI Website and you can access it from the NCBI website from HERE.


You can find more information about MEDLINE and PubMed, Tutorials and quick tours HERE.

Any question, comment.

Saturday, February 6, 2010

Bioinformatics: How to prepare your sequences for a phylogenetic tree

ADVERTISEMENTS

Bioinformatics: How to prepare your sequences for a phylogenetic tree

In order to make a phylogenetic tree, we have to do a multiple sequence alignment first, because you can't make a good and accurate tree without an accurate multiple sequence alignment.

To learn haw to build a multiple sequence alignment, you can see this video tutorial HERE.

To build a multiple seqeunce alignment and then a phylogenetic tree, you have to prepare you sequences considering some factors:

1- Avoid using sequence fragments: you have to align the complete sequences not only fragments, and if you want to align fragments, you have to use fragments for all sequences that you want to align.

2- Avoid using a lot of sequences: large datasets or large number of sequences can make your phylogenetic tree not accurate, because most algorithms can't handle large datasets especially softwares that are used online, because it will take a lot of time and hurt your phylogenetic tree accuracy.

3- Avoid aligning Xenologs: because they are produced by lateral transfer by a virus or bacteria, and they can't make the original history of your gene, if you want more information about Xenologs you can read this post HERE.

4- Avoid recombinant sequences: because recombinant sequences are a result of two species (may be very distinct species), Phlogenetic trees builders can't handle the history of two distinct species in the same time.

5- Add a distant sequence to your alignment: it has to be similar but diverged long time ago, because it will work as the first common ancestor to you phylogenetic tree.

6- Don't depend on guide trees: On EBI server for example, when you make a multiple sequence alignment with ClustalW, a guide tree is included in the results, don't use this tree because its not a phylogenetic tree, it's a guide tree that ClustalW uses to assemble the multiple sequence alignment, if you use it in place of phylogenetic tree, it will give you false results.

Any question you're welcome.

Thursday, February 4, 2010

Bioinformarics: different types of homologous genes

ADVERTISEMENTS

Bioinformarics: different types of homologous genes


The main purpose of phylogeny is to pick what we call Homologous genes and compare them to construct a phylogenetic tree of their history, according to their similarities.

Homologous genes are genes that derive from a common ancestor. To understand the homologous genes types and how exactly they derive, we have to know couple of things

* Speciation: is the phenomenon during which a common ancestor gives birth to two subgroups that slowly drift away from their common genetic makeup to become distinct species.

* Duplication: Means that within the same genome of the same species, the gene was duplicated, in this case, may be one of the genes remain the same with the same function, and the other may change.


Homologous genes have three types:

1- Orthologs: Orthologs are 2 genes that are separated by speciation, it means generally that 2 genes exist in 2 different species, but they were in the same common ancestor.

2- Paralogs: Paralogs are 2 genes separated by duplication, this means that the same gene in one genome was duplicated to 2 genes or more.

3- Xenologs: Xenologs result from Lateral Transfer between 2 species or organisms, a DNA transfer from species to another, like the transfer of a DNA sequence from a virus or bacteria to another species.


In bioinlformatics collecting these genes from Blast searches, and aligning them into a multiple sequence alignment is the main tool to construct a phylogenetic tree.

Any questions, you are welcome.

Tuesday, February 2, 2010

How studying rRNA can help us studying evolution in Bioinformatics

ADVERTISEMENTS

How studying rRNA can help us studying evolution in Bioinformatics

Many of you are asking, how scientists have made an approximate tree of life that have almost all discovered species, well this is the answer:

In Evolutionary Bioinformatics scientists have tried to find a gene that exists in all living organisms, well the very appropriate gene in this case will be the rRNA coding gene.

rRNA or ribosomal RNA is the central component of the ribosome, its where proteines are manufactured in all living organisms, it's the one that interacts with tRNA or Transfert RNA to produce a protein from amino acids and mRNA or messenger RNA.

So the main criteria to study evolution is finding a conserved gene that exists in all living organisms, so the main thing scientists do when they discover new bacterium for example is to sequence its rRNA to identify its taxonomic group and estimate rates of species divergence.

As rRNAs have played and are playing a major role in Evolutionary Bioinformatrics, scientists and researchers have made specialized databases like RDP and the European database that have thousands of rRNA sequences stored.

Any question, comment.