Bioinformatics made easy: December 2009

Thursday, December 31, 2009

Bioinformatics:Phylogeny: What Phylogenetic Trees can do for you?

As you know the purpose of phylogeny is to construct a history of life to better understand it, the main ourpose of phylogeny is to groupe organisms according to their similarities.

Genes mutates over time and changes, we mean by that EVOLUTION, that's why there are a lot of species (Diversity) on earth, and that's when Phylogenetics become an indispensable science to Bioinformatics especially Phylogeny.

Phylogenetics is a science that is part of phylogeny and that relies on the comparison of many species genes to find out which species are more related to others and to construct a tree of these species.

To better understand Phylogenetic trees you can read this post HERE.
To learn how to use Bioinformatics tools of constructing phylogenetic trees (As PHYLIP) you can read this post HERE.

Phylogenetic Trees can do:

1- Determining the most relative organism to yours your studying.

2- Determining the function of a gene by looking at its relatives (Orthologous genes).

3- Determining genes family.

4- Finding out about the origin of the gene you're studying.

Monday, December 28, 2009

Bioinformatics: Main Applications Of Multiple sequence Alignment

You can read an introductory post to Multiple Sequence Alignment HERE, to understand what is a Multiple Sequence Alignment.

Multiple Sequence Alignment is almost the most useful tool in Bioinlformatics, it helps almost in every application of Bioinformatics (predicting protein structure, predicting protein function, phylogenetic analysis...etc).

The main applications of Multiple Sequence Alignment are:

1- Structure Prediction: a Multiple Sequence Alignment can give you the almost perfect protein or RNA secondary structure, some times it helps even with the 3D structure.

2- Protein Family: a Multiple Sequence Alignment can help you to decide that your protein is a member of a known protein family or not.

3- Pattern Identification: By looking at conserved regions or sites, you can identify which region is responsible for a functional site.

4- Domain Identification: By looking at file provided by a Multiple Sequence Alignment, you can extract profiles to use them against databases.

5- DNA Regulatory Elements: You can use Multiple Sequence Alignments to locate DNA regulatory elements such as binding sites...etc.

6- Phylogenetic Analysis: By carefully picking related sequences you can reconstruct a tree using sequences that u have used in the Multiple Sequence Alignment (You can use the PHYLIP package and you can find a post about it here).

As Multiple Sequence Alignments are playing a major role in Bioinformatics, you can use it almost anywhere but as every thing on this earth, nothing is perfect or 100% accurate, so u have to choose your sequences very carefully to prevent meaningless results.

You can access the EBI ClustalW program from HERE, to do a Multiple Sequence Alignment.

Any comments you're welcome.

Friday, December 25, 2009

Bioinformatics:Multiple sequence alignment: ClustalW

What is multiple sequence alignment:

Multiple sequence alignment is an alignment of more than one (Protein or Nucleic Acid "DNA & RNA") sequence.

What's ClustalW:

ClustalW is a large and complex program for multiple sequence alignments.

Why use ClustalW:

As we said before ClustalW is for multiple sequence alignments which are very important in bioinformatics field and especially studying sequences, by doing a multiple sequence alignment for protein sequences for example we can extract these very useful informations:

1- Conserved sequence regions.

2- Knowing which are active sites and which are not.

3- Predicting protein function.

4- Helping in predicting protein structure.

5- Identify protein family or new members.

6- Calculating trees to know proteins relationship (Distances)...etc.

You can find ClustalW at EBI and you can access it from HERE.

Monday, December 21, 2009

Bioinformatics Tutorials & Lessons: Using TMHMM method to locate Transmembrane helices in Protein sequences

TMHMM is an abreviation of (Transmembrane Hidden Markov Model) which is a statistical model, you can read about this model in this Wikipedia article HERE.

TMHMM is a method for Predicting Transmembrane Helices in a Protein sequence, you can access the TMHMM server from HERE.

This Video is about how to use the TMHMM server to predict Transmembrane helices in a Protein sequence.

Friday, December 18, 2009

Bioinformatics: PHYLogeny Inference Package (PHYLIP)

PHYLIP or the PHYlogeny Inference Package is a package that contains a lot of programs for infering Phylogenies or by simple words constructing Phylogenetic or Evolutionary Trees.

The Package contains a lot of useful programs and above all of that its free and you can get it from its website from HERE

The Programs contained in the PHYLIP Package can estimate Phylogenies from Protein sequences or Nucleic Acid sequences with different methods (parsimony, maximum likelihood...etc)

It was and still very helpful for Bioinformaticiens and Phylogeny scientists and students as it can provide a complete environment for Phylogeny .

You can read the documentation file from HERE.

Any questions you're welcome.

Wednesday, December 16, 2009

Bioinformatics Tutorials & Lessons: using BLAST to search for similarities

BLAST (Basic Local Alignment Search Tool) is an algorithm or program that can identify similar (Nucleic Acid or Amino Acid) sequences to a query sequence.

Lets say that you have sequenced recently a gene from the mouse genome and you have nothing about this gene except its sequence, here comes the role of BLAST, it searches databases for similar sequences to yours, by this you will find informations about similar sequences to yours like (Gene or protein Family, Organism, related sequences, function...etc), this will help you to identify your sequence.

You can read this wikipedia article to know more about BLAST from HERE

You can read the BLAST help page from HERE or the Documentation from HERE.

This is a video tutorial that demonstrates how to use BLAST to search for similar protein sequences to my sequence.
(I used BLAST of SwissProt database)

P9BTHZ2GQVAE

Sunday, December 13, 2009

Bioinformatics:Open Reading Frame (ORF)

The Open Reading Frame or (ORF) is a sequence of DNA located between the start-code sequence (initiation codon) and the stop-code sequence (termination codon).

The ORF finder softwares or algorithms are used to locate a gene in a given sequence by locating the initiation codon and the termination codon.

The initiation and termination codon can occur by chance so they could falsify our results, but in general the sequence found between them is not long enough, so to make sure its an ORF, we have to make sure that the sequence between the initiation and termination codon is long so it can represent a GENE.

The DNA sequence can be read in SIX different reading frames, 3 for each strand (because every codon have 3 bases).

In eucaryotic DNA we may find overlapping sequences withing a gene, these overlapping sequences are called INTRONS and they do not code for proteines.

Example: if we have a mRNA sequence:

1 st reading frame: AGUAAGAUGGCGAAUCUU
2 nd reading frame: - GUAAGAUGGCGAAUCUU
3 rd reading frame: - - UAAGAUGGCGAAUCUU

We can see that the first reading frame contains an initiation codon (AUG), the 2nd doesn't contain anything, the 3rd contains a stop codon (UAA).

So if we are about to choose a correct reading frame we would choose the first one.

There are many softwares dedicated for ORF detection, GeneMark is one of the best, it is a family of gene prediction programs developed at
Georgia Institute of Technology, Atlanta, Georgia, USA. You can access it from HERE.

Any questions you're welcome.

Friday, December 11, 2009

Bioinformatics:Bio Linux 5.0

Bio Linux 5.0 is a project released in January 2009 for students and researchers in the field of Bioinformatics, it's a linux envirement (ubuntu) + more than 500 Bioinformatics programs with full documentation to each program.

This means that we can say that Bio Linux 5.0 is an easy to use Bioinformatics Workstation, powerful and easy to configure.

Bio Linux 5.0 is developed and maintained by the NERC Environmental Bioinformatics Centre, it contains a complete analysis and development environment easy to use by Bioinformaticiens.

Bio Linux 5.0 can run in a live DVD, that means that you can run it without installing it (without affecting your system), it can also run in a memory stick, You can install it in dual boot with Windows or in a virtual machine if you want to run it with Windows in the same time.

Above all of this Bio Linux 5.0 is FREE and you can download Bio Linux 5.0 from HERE

To access the NERC Environmental Bioinformatics Centre Homepage click HERE

If you already have an ubuntu system installed on your machine you can download Bio Linux 5.0 Packages from HERE and install them on your ubuntu, but i don't recommend that, because it takes more time and effort with less packages (Bio Perl, Bio Python...etc) not included in package repository, so the easy way is to download the full Bio Linux 5.0 and install it directly.

Any questions, please comment.

Monday, December 7, 2009

Bioinformatics: PDB Database

The PDB or Protein DataBank is a Database that contains three-dimentional structures of large biological molecules such as: Proteins and Nucleic Acids.

The data provided by this Database is experimental (X-ray crystallography or NMR spectroscopy), biologists and biochemists submit structures from all over the world.

The PDB Database is playing a major role in Bioinformatics especially Structural Biology.

The PDB database is updated weekly (on Tuesday).

To access PDB Database click HERE

For more informations about Protein Sequence Databases, read THIS POST.

For more informations about the SwissProt Database, read THIS POST.

For more informations about the PIR Database, read THIS POST.

Any comments you are welcome.

Wednesday, December 2, 2009

Bioinformatics: PIR Database

The Protein Information Resource (PIR) is a major player in Bioinformatics field (Proteomics). It is a joint effort between Georgetown University Medical Centre and the National Biomedical Research Foundation in Washington, D.C.

It was established in 1984 and resulted from the work of Dr. Margaret Dayhoff. Her Atlas of Protein Sequence and Structure, published from 1965–1978, was the first comprehensive collection of protein sequences.

In 1974, Dr.Dayhoff devised the concept of the protein family and superfamily, defined by sequence similarity, as a means of organising and classifying proteins.

In recent years, this concept has been exploited by the PIR Protein Sequence Database (PIR-PSD) to enable them to computer-annotate their entries with functional and structural data. This has facilitated an increase in the number of sequences in the database.

There are many other Databases provided by PIR:

1- PIR-PSD: it has been the most comprehensive and expertly-curated protein sequence database in the public domain for over 20 years. In 2002, PIR joined EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics) to form the UniProt consortium. PIR-PSD sequences and annotations have been integrated into UniProt Knowledgebase.

2- IProClass: integrated resource of family relationships and structural and functional features of proteins. The iProClass database provides value-added information reports for UniProtKB and unique NCBI Entrez protein sequences in UniParc, with links to over 90 biological databases, including databases for protein families, functions and pathways, interactions, structures and structural classifications, genes and genomes, ontologies, literature, and taxonomy.

3- The comprehensive PIR-NREF database of protein sequences: from PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, and PDB. PIR-NREF has been discontinued inasmuch as the UniProt databases now include all of its functionalities (Final Release 1.83, 16-Jan-2006). This consolidation provides one centralized comprehensive database and minimizes duplication of work between UniProt and PIR.

4- PIRSF: The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships. The PIRSF classification system is based on whole proteins rather than on the component domains; therefore, it allows annotation of generic biochemical and specific biological functions, as well as classification of proteins without well-defined domains.

To access the PIR Database click HERE

For more informations about Protein Databases, read THIS POST.

For more informations about the SwissProt Database, read THIS POST.

Read THIS POST to learn how to use the SwissProt Database.

Any questions you're welcome.

Do you find this blog helpful?

Bioinformatics BookStore

Flu.gov

FeedBurner FeedCount

Kontera

Thursday, December 31, 2009

ADVERTISEMENTS

Monday, December 28, 2009

ADVERTISEMENTS

Friday, December 25, 2009

ADVERTISEMENTS

Monday, December 21, 2009

ADVERTISEMENTS

Friday, December 18, 2009

ADVERTISEMENTS

Wednesday, December 16, 2009

ADVERTISEMENTS

Sunday, December 13, 2009

ADVERTISEMENTS

Friday, December 11, 2009

ADVERTISEMENTS

Monday, December 7, 2009

ADVERTISEMENTS

Wednesday, December 2, 2009

ADVERTISEMENTS

Most Popular Lessons

Pages

Chitika

Subscribe via email

Subscribe To

Labels

Bioinformatics made easy

Blog Archive

StatCounter