Bioinformatics made easy: 2010

Saturday, April 24, 2010

Bioinformatics:Multiple sequence alignment different formats:

Bioinformatics:Multiple sequence alignment different formats:

People sometimes find it confusing when it comes to different multiple sequence alignment formats (what to use with what???), that's because the variety of programs that handles multiple sequence alignments, sometimes you find a program that uses FASTA format and sometimes MSF (Multiple Sequence Format)...etc.

The reason why there are a lot of formats out there, is that every format had appeared by specialists in a specific field, for example specialists in phylogeny use Phylip format...etc

So before you use any format you have to ask yourself questions like: is this format supported by the program i'm running, is it easy for me to modify in it, is it widely accepted...etc.

Some of the most popular multiple sequence alignment formats:

1- FASTA: a text format that's widely accepted and its easy to read and modify.
2- MSF: (Multiple Sequence Format), the most popular, supported by most programs, easy to read and difficult to modify.
3- ALN: produced by ClustalW, easy to read and widely supported.
4- Phylip: text format, supported by most phylogenetic packages.

Any question, u're welcome.

Tuesday, March 23, 2010

Bioinformatics careers: Bioinformatics Systems Analyst Job Offer

Bioinformatics careers: Bioinformatics Systems Analyst Job Offer

The DOE Joint Genome Institute (JGI) in Walnut Creek, CA has a job offer for an experienced Bioinformatics Analyst to support the Plant Genome Assembly Analytics Group.

The main responsibility includes obtaining Genomes data and analyzing it using several software, storing and organizing it...etc

For more info about this offer (job requirements and how to apply) you can visit this link HERE.

Bioinformatics: NCBI releases a Database of Genomic Structural Variation (dbVar)

Bioinformatics: NCBI releases a Database of Genomic Structural Variation (dbVar)

As part of NCBI, a new database was released, this database is the Database of Genomic Structural Variation (dbVar), this database contains data from the analysis of genomic variations and their relationship with phenotype information.

The dbVar homepage contains links that help you understand what Genomic Structural Variations are, FAQs, and submission of information.

The database also include an RSS feed to let you know about any updates.

Monday, March 22, 2010

Bioinformatics: Job Offer at University of Copenhagen

The Center for non-coding RNA in Technology and Health at Bioinformatics Faculty of Life Sciences, University of Copenhagen in Denmark, is a newly established center that specialized in studying non-coding RNAs. The center has a position as PhD fellow in bioinformatics open, with start May 1st or soon thereafter. The duration is three years.

This project concentrates on studying ncRNAs or non-coding RNAs, their role, structure...etc

The project will be in collaboration with Prof. Henrik Nielsen, University of Copenhagen and others.

For more on the job description and Qualification requirements, you can read the full article HERE.

Friday, March 19, 2010

Bioinformatics: Protein sequence databases names for use with BLAST

Bioinformatics: Protein sequence databases names for use with BLAST

In the "Nucleotide sequence databases names for use with BLAST" post, we've seen the nucleotide sequence databases names that we can use in a command line BLAST search, the same thing applies to Protein sequence databases when it come to use command line BLAST.

Here are some of the protein databases names that we can use with BLAST:

1- nr: Non-redundant merge of SWISS-PROT, PIR, PRF, and proteins derived from GenBank coding sequences and PDB atomic coordinates

2- swissprot: The SWISS-PROT database

3- pdb: Amino acid sequences parsed from atomic coordinates of three-dimensional structures

4- ecoli: All proteins encoded by the E. coli genome

5- yeast: All proteins encoded by the S. cerevisiae genome

6- drosoph: All proteins encoded by the D. melanogaster genome

These are some of the abbreviations used in a commend line BLAST search, if you want more you can read the documentation of using command line BLAST on the internet.

As a Bioinformatician, learning to use command line BLAST on Linux is very important, because it will make parsing files and looking for specific info very easy, because what takes 1 minute in an automated task, will take half an our doing it by hand, and the number goes with the amount of data you want to retrieve.

Any questions, you're welcome:-).

Sunday, March 14, 2010

Bioinformatics: video tutorial: Using PHYLIP to build phylogenetic trees

Bioinformatics: video tutorial: Using PHYLIP to build phylogenetic trees

As you know PHYLIP or (PHYlogeny Inference Package) is a set of programs that can construct phylogenetic trees.

To understand what phylogenetic trees can do for you you can read this post HERE.

In order to build phylogenetic trees you have to prepare a set of sequences in a multiple sequence alignment.

In this video tutorial i'm going to use one of the 3 methods in building phylogenetic trees, which is distance methods by using a program included in PHYLIP called protdist.

This video tutorial have 2 parts:

Part 1:

Part 2:

Any questions, comment

Sunday, March 7, 2010

Bioinformatics: Nucleotide sequence databases names for use with BLAST

Bioinformatics: Nucleotide sequence databases names for use with BLAST

In most cases people like to use BLAST that is hosted on servers like NCBI, but sometimes you would like to use a command line BLAST already installed on your computer on a Windows or Linux operating system.

In order to do that you have to know the databases names that you can use in the command line BLAST.

Here are some of the nucleotide databases names that you can use with BLAST:

1- nr : Nonredundant GenBank, a database that provides comprehensive collections of both amino acid and nucleotide sequence data, with redundancy reduced by merging sequences that are completely identical.

2- est : expressed sequence tags.

3- sts : sequence tagged sites.

4- htgs : high-throughput genomic sequences.

5- ecoli : Complete genomic sequence of E. coli.

6- yeast : Complete genomic sequence of S. cerevisiae.

7- drosoph : Complete genomic sequence of D. melanogaster.

8- mito : Complete genomic sequences of vertebrate mitochondria.

9- vector : Collection of popular cloning vectors.

These are some of the most used nucleotide databases names in a BLAST search.

This is an example of a BLAST command:

blastall -i blast.in -d nr -o blast.out

blastall: program name.

-i : input.

blast.in & blast.out : input and output file containing the sequence.

-d : database.

nr : Nonredundant.

You can read (using BLAST to search for similarities) post to learn how to run a BLAST search against a database.

You can read (Different Blast Programs) to learn about different BLAST programs.

Saturday, February 27, 2010

Bioinformatics: Different methods used to build phylogenetic trees

Bioinformatics: Different methods used to build phylogenetic trees

In Bioinformatics there are three major methods used in building phylogenetic trees, every one of these methods have its own weaknesses and strengths as the case with every bioinformatics program or method.

These methods are:

1- Distance methods: In this method the algorithm takes the data (sequences) and construct a distance matrix between each 2 sequences, after that the sequences are regrouped depending on their relative distance, the last step is to construct a tree that matches this data.

2- Parcimony methods: This method searches in all possible phylogenetic trees that needs the minimum number of substitutions of nucleic acids or amino acids (mutations), so the best tree is the one that have the minimum number of mutations.

3- Likelihood methods: This method means that the best estimate of a parameter is that giving the highest probability that the observed set of measurements will be obtained.

Bioinformaticians say that Likelihood methods are the most accurate and the best, because most researchers use them, but the problem is that they run very slow because of their long algorithms.

Parcimony methods have great results but they have probably the same negative side of Likelihood methods.

Distance methods or distance based trees are easy to set up, and you can apply them in most situations, but they aren't necessarily the most accurate.

How to prepare your sequences for a phylogenetic tree

What Phylogenetic Trees can do for you?

Any question comment.

Wednesday, February 24, 2010

Bioinfrmatics:video tutorial:using Genomescan to parse genomes (find exons)

Bioinfrmatics:Video Tutorial:using Genomescan to parse genomes (find exons)

In this video tutorial we are going to see how to use Genomescan to parse large DNA sequences and find coding regions or Exons.

As you know higher organisms genes like vertebrates are more complex then others, because they contain coding regions called Exons and between these Exons we find non coding regions called Introns.

To predict these genes which contain several Exons, you have to use a very sophisticated algorithms, that can locate Exons and Introns and by that locating genes.

You can read this post (Open Reading Frame (ORF)) to understand what are ORFs.

You can read this post (Using ORF Finder to locate open reading frames) for a basic software that can find ORFs.

You can read this post (Sophisticated ORF prediction with GenMark) for a more sophisticated ORF prediction software.

Sunday, February 21, 2010

Bioinformatics: What Is PSI-BLAST?

Bioinformatics: What Is PSI-BLAST?

PSI-BLAST (Position-Specific Iterative BLAST) is a software designed for proteins, and it's a BLAST search that uses a PSSM (position-specific scoring matrix).

What is PSSM?

PSSM (position-specific scoring matrix) is a matrix used for biological data, and its main role in PSI-BLAST search is to increase the sensitivity of results.

PSI-BLAST search uses PSSM as a query instead of individual sequence, it's like a matrix constructed from a multiple sequence alignment and then each position of the alignment will have its own position specific score.

How PSI-BLAST works?

It begins with a normal BLAST search (the more match, the more score), but in this case a regular BLAST search will probably miss more distant and may be interesting homologies, so next PSI-BLAST will construct a PSSM (position-specific scoring matrix) and repeat the search until no new matches are found, this will result in finding new distant sequences that you are may be interested in.

You can read this post (Different Blast Programs) to understand all types of BLAST programs including PSI-BLAST and what each one do.

You can access PSI-BLAST from EBI website HERE.

Friday, February 19, 2010

Bioinformatics: Perl and BioPerl

Bioinformatics: Perl and BioPerl

As you all know, Bioinformaticians are 2 types:

1- That use ready softwares to analyse biological data.

2- That design new softwares for them or for other Bioinformaticians.

As we discussed on earlier post about The best programming language for bioinformatics HERE, we said that Perl (Practical Extraction an Report Language) is the most powerful because:

1- It is installed or included in almost every Linux distribution.

2- The scripts written by Perl doesn't require compilation (They are portable from one system to another).

3- It supports regular expressions (a very powerful controle and manipulation of strings).

4- And what makes it very unique programming language comparing to others, its support to Hashes or Table Hashes (association of values with keys).

5- It contains an unlimited number of ready modules on internet that anybody can use.

6- It is available also for Windows.

You can read this post about the best book to begin programming with Perl for bioinformatics called Beginning Perl for Bioinformatics.

What is BioPerl?

BioPerl is a project developed by Open Bioinformatics Foundation and is a collection of modules that you can use to easily contruct Perl scripts to automate tasks for bioinformatics.

With BioPerl you don't have to do anything from scratch, so you use ready modules that suites your needs (what do you want more than that???).

In my opinion i see that Perl is the best programming language for bioinformatics, if you have a different point of view, you can suggest it in comments.

Tuesday, February 16, 2010

Bioinformatics: Sophisticated ORF prediction with GenMark

Bioinformatics: Sophisticated ORF prediction with GenMark

Orf prediction programs are a key to locate ORFs (Open Reading Frames), and if we locate ORFs we have an approximative idea of the location of your gene that is coding for a protein.

To read about ORFs or Open Reading Frames click HERE.

In the how to work with ORF finder program to predict ORFs video tutorial, i've showed you how to use ORF Finder program developed by NCBI to locate ORFs, but i've said that this software is very basic, so we can use it only with simple genomes (Viral, Bacterial...etc), bacause these kind of programs can identify only about 80 percent of Protein Coding regions that you may be interested in.

You can see ORF Finder video tutorial HERE.

In this video tutorial i'm going to show you a more sophisticated approach that can predict ORF of (Bacteria, Viruses, Eucaryotes...etc), this software is a familly of different programs that use a very sophisticated method.

Sunday, February 14, 2010

Bioinformatics: Linux Vs Windows (What's Better For Bioinformatics)?

Bioinformatics: Linux Vs Windows (What's Better For Bioinformatics)?

People have 2 big choices when it comes to use operating systems especially Bioinformaticians, Linux and Windows, but there is a huge difference between these 2 operating systems.

Windows:

Windows is known for its simplicity (Anyone with a basic knowledge can work with windows), so it's user friendly, great interface, great media support, but it is less adapted to Bioinformaticians needs and:

- Its not free.
- Its source is not open to buplic.
- Most of its softwares are not free.
- You can't automate instructions...etc

I'm not saying that Windows isn't good for you, because i work with it most time, but if you are a Bioinformatician and you want to program new softwares or automate some instructions, than Linux is definitely for you, if you want to use ready softwares to analyse your data you can use Windows.

UNIX (Linux):

Linux is a very powerful operating system especially for programmers because it gives you full controle over your machine:

- It has a lot of programming tools (languages and interfaces).
- Other free softwares as (Webservers, Database management system, visualisation softwares, text editing...etc).
- Statistic analysis (like R).
- Unix is more stable and runs fast.
- Vast ducumentation for softwares (How to use stuff!!!).

So if you are a bioinformatician that is more likely attracted to biology (you use bioinformatics softwares only for analysing your biological data) then you can use Windows, but if you are a cyber geek!!! that wants to develop new softwares for bioinformaticians then you can use Linux, i personally prefer Linux but in the end it's up to you to decide.

If you want to use BioLinux 5.0 you can read a post abou it HERE.

If you want to know how to have BioLinux 5.0 working on you computer you can read this post HERE.

If you have any question, put it in comment.

Friday, February 12, 2010

Bioinformatics: OMIM (Online Mendelian Inheritance in Man) Overview

Bioinformatics: OMIM (Online Mendelian Inheritance in Man) Overview

While bioinformatics is a key in analyzing genes, proteins, genomes, mutations...etc, researchers use these information to understand genetic diseases especially in human, this is where bioinformatics is playing a major role in finding, analyzing, and treating these genetic disorders and for that NCBI has developed OMIM and made it available to public.

OMIM (Online Mendelian Inheritance in Man) is a database which contains a catalog for human genes and genetic disorders, the database was developed by NCBI (National Center for Biotechnology Information) and it is hosted on their server.

OMIM contains information about all known genetic disorders and it links to other resources like MEDLINE (Citations and abstracts) and even links to other NCBI databases entries that are responsible for certain diseases.

OMIM has three ways to search for genetic disorders or related information:

1- Through a normal search: by typing a keyword like in the case of most databases.

2- By using the Gene map: where you can browse a table of genes organized by cytogenetic map location.

3- By using the Morbid map: which is a table of all alphabetically listed genetic disorders featured in OMIM.

To access OMIM click HERE.

Any questions you're welcome.

Wednesday, February 10, 2010

Bioinformatics: Using ORF Finder to locate open reading frames

Bioinformatics: Using ORF Finder to locate open reading frames

In this video tutorial, i'm going to show you haw to use the ORF Finder software to find or locate open reading frames (possible protein coding genes).

ORF Finder is a software located at the NCBI Website and it is designed to locate open reading frames in a given DNA sequence in all the six reading frames.

To know more about Open Reading Frames,you can read this post HERE.

Note: This software (ORF Finder) is a basic software, so you can use it in the case of non complex genes (Microbial genomes).

There is a more sophisticated softwares that can handle the complexity of higher organisms genomes like GenMark.

Monday, February 8, 2010

Bioinformatics: What is MEDLINE and PubMed?

Bioinformatics: What is MEDLINE and PubMed?

Researchers citations are very important in any research at any given field. For bioinformatics these citations are indispensable in any research, for this reason the united states National Library of Medicine (NLM) is providing biomedical literature to researchers or students online.

Since 1879, the NLM has published the Index Medicus which is an index or guide to articles, but with the evolution of information technology, Index Medicus has became a database now known as MEDLINE.

What is MEDLINE:

MEDLINE or (Medical Literature Analysis and Retrieval System Online) is a huge bibliographic database that contains articles from academic journals covering : biology, all branches of medecine, health, molecular biology, biochemistery, microbiology...etc, this data is accessible free over the internet via PubMed.

What is PubMed?

PubMed is a part of Entrez retrieval system, and is a search engine or retrieval system to access MEDLINE citations, abstracts, and full text articles. In addition to MEDLINE citations which are the most found by PubMed, PubMed provides access to other records including in-process citations, some life science journals that submit full text to PubMed Central and may not have been recommended for inclusion in MEDLINE.

As we said before PubMed is part of Entrez retrieval system which is part or the NCBI Website and you can access it from the NCBI website from HERE.

You can find more information about MEDLINE and PubMed, Tutorials and quick tours HERE.

Any question, comment.

Saturday, February 6, 2010

Bioinformatics: How to prepare your sequences for a phylogenetic tree

Bioinformatics: How to prepare your sequences for a phylogenetic tree

In order to make a phylogenetic tree, we have to do a multiple sequence alignment first, because you can't make a good and accurate tree without an accurate multiple sequence alignment.

To learn haw to build a multiple sequence alignment, you can see this video tutorial HERE.

To build a multiple seqeunce alignment and then a phylogenetic tree, you have to prepare you sequences considering some factors:

1- Avoid using sequence fragments: you have to align the complete sequences not only fragments, and if you want to align fragments, you have to use fragments for all sequences that you want to align.

2- Avoid using a lot of sequences: large datasets or large number of sequences can make your phylogenetic tree not accurate, because most algorithms can't handle large datasets especially softwares that are used online, because it will take a lot of time and hurt your phylogenetic tree accuracy.

3- Avoid aligning Xenologs: because they are produced by lateral transfer by a virus or bacteria, and they can't make the original history of your gene, if you want more information about Xenologs you can read this post HERE.

4- Avoid recombinant sequences: because recombinant sequences are a result of two species (may be very distinct species), Phlogenetic trees builders can't handle the history of two distinct species in the same time.

5- Add a distant sequence to your alignment: it has to be similar but diverged long time ago, because it will work as the first common ancestor to you phylogenetic tree.

6- Don't depend on guide trees: On EBI server for example, when you make a multiple sequence alignment with ClustalW, a guide tree is included in the results, don't use this tree because its not a phylogenetic tree, it's a guide tree that ClustalW uses to assemble the multiple sequence alignment, if you use it in place of phylogenetic tree, it will give you false results.

Any question you're welcome.

Thursday, February 4, 2010

Bioinformarics: different types of homologous genes

Bioinformarics: different types of homologous genes

The main purpose of phylogeny is to pick what we call Homologous genes and compare them to construct a phylogenetic tree of their history, according to their similarities.

Homologous genes are genes that derive from a common ancestor. To understand the homologous genes types and how exactly they derive, we have to know couple of things

* Speciation: is the phenomenon during which a common ancestor gives birth to two subgroups that slowly drift away from their common genetic makeup to become distinct species.

* Duplication: Means that within the same genome of the same species, the gene was duplicated, in this case, may be one of the genes remain the same with the same function, and the other may change.

Homologous genes have three types:

1- Orthologs: Orthologs are 2 genes that are separated by speciation, it means generally that 2 genes exist in 2 different species, but they were in the same common ancestor.

2- Paralogs: Paralogs are 2 genes separated by duplication, this means that the same gene in one genome was duplicated to 2 genes or more.

3- Xenologs: Xenologs result from Lateral Transfer between 2 species or organisms, a DNA transfer from species to another, like the transfer of a DNA sequence from a virus or bacteria to another species.

In bioinlformatics collecting these genes from Blast searches, and aligning them into a multiple sequence alignment is the main tool to construct a phylogenetic tree.

Any questions, you are welcome.

Tuesday, February 2, 2010

How studying rRNA can help us studying evolution in Bioinformatics

Many of you are asking, how scientists have made an approximate tree of life that have almost all discovered species, well this is the answer:

In Evolutionary Bioinformatics scientists have tried to find a gene that exists in all living organisms, well the very appropriate gene in this case will be the rRNA coding gene.

rRNA or ribosomal RNA is the central component of the ribosome, its where proteines are manufactured in all living organisms, it's the one that interacts with tRNA or Transfert RNA to produce a protein from amino acids and mRNA or messenger RNA.

So the main criteria to study evolution is finding a conserved gene that exists in all living organisms, so the main thing scientists do when they discover new bacterium for example is to sequence its rRNA to identify its taxonomic group and estimate rates of species divergence.

As rRNAs have played and are playing a major role in Evolutionary Bioinformatrics, scientists and researchers have made specialized databases like RDP and the European database that have thousands of rRNA sequences stored.

Any question, comment.

Sunday, January 31, 2010

5 Things any Bioinformatician should know

5 Things any Bioinformatician should know

1- How to work with a computer: And i mean by that, how to work with at least one operating system like Windows for example, most of bioinformatics students and researchers like Linux because its open source and all of its softwares are free, but i tell you that Windows is not bad at all for Bioinformatics, because most softwares designed for linux are availible for Windows too.

2- How to use internet browsers: This is indispensable because the internet is what made Bioinformatics move so fast, so if you want to be a bioinformatician, you have to know how to work with internet browsers like (internet explorer, netscape, chrome, firefox), i personally prefer Firefox, i find it very easy and powerful.

3- How to install a new software: you should have this easy knowledge, because installing a Windows based software is a peace of cake comparing to Linux based one.

4- A little knowledge of Molecular Biology: You can't be a Bioinformatician without having a litlle knowledge in Biology especially Molecular Biology and genetics, it will be like you want to play guitar and you don't know what is a guitar...!

5- How to surf the internet: This is very important as most of bioinformatics operations are made online, so you have to know how to open a website, surf it, download from it...etc

The most important knowledge that you should have about the How to surf the internet, is how to use Search Engines, because they will provide you with anything you will need.

These are the basic skills that any Bioinformatics student should have.

For more suggestions about this, please comment.

Friday, January 29, 2010

List of the most popular and useful Databases in Bioinformatics

List of the most popular and useful Databases in Bioinformatics

As Biological data is growing every day, maintaining this huge amount of data has became hard, so i'll give you what i call the best organized and maintained bioinformatics databases.

Genbank on NCBI : this database is the most powerful in bioinformatics because its designed for every thing : proteins genes genomes, structures, ………etc.
To visit NCBI click HERE.

Swissprot: if your query is a protein sequence i advise you to use SwissProt that is located on the expasy proteomics server, in addition you'll find dozens of useful programs that you can use to analyze your sequence.
To visit swissprot or the expasy proteomics server click HERE.

Integrated Microbial Genomes: this database is for complete genomes, i like it because its very organized and anyone can get used to it in a few minutes
To visit the Integrated Microbial Genomes click HERE.

TIGR: The Institute for Genomic Research founded by Craig Venter is a project for complete bacterial genomes, if you are a microbiologist, then this database is exactly for you, in addition to the database, bioinformaticiens working in the TIGR project had developped a set of very useful tools to analyses the database genomes such as : GLIMMER, MUMer...etc.
To visit TIGR project click HERE.

Enssembl: for me its the best database for complete genomes because it containes a lot of graphic tools for interpreting and analyzing data, that means that you don't get boared while exploring it,all is visual!!!.
To visit Enssembl click HERE.

There are more databases and project on the internet, but i found these databases very helpful in my reasearch.

If you have more useful databases or projects you can post it in the comment section.

Wednesday, January 27, 2010

Bioinformatics: Transcriptomics

Bioinformatics: Transcriptomics

In human DNA, less than 5% of the genome is transcribed, the rest of the genome is playing the role of watching and controlling and regulating the 5%, that's why the cellular processes are very precise.

So now after the extencive sequencing projects of different genomes, the new challenge is to try to identify expression patterns of genes we have sequenced, thats when Transcriptomics will become very useful.

So what is Transcriptomics?

Transcriptomics is the study of the complete set of RNA transcripts produced by the genome (Transcriptome) at a given time.

Transcriptomics also called gene expression profiling or genome-wide expression profiling sometimes provide solutions to understand genes and pathways involved in biological processes, so simply it examines the expression level of mRNAs.

So what can transcriptomics do for us?

As mentioned before Transcriptomics will give us answers as which gene is activated, and when its activated, by what its activated...etc

In Transcriptomics identifying similarities in expression pattern give us clues that the genes are functionally related and they have the same genetic control mechanism.

The most common technology used to study expression levels is DNA Microarray.

To understand what Microarrays are used for or Microarrays main applications, please read THIS POST.

Any questions, be free to comment.

Friday, January 22, 2010

The best Bioinformatics programming language

As you now, bioinformatics is the use of computer hardware and software to analyze or interpret biological data, most of bioinformaticiens use ready programmed softwares, and most of these softwares can give you what you exactly want.

But lets say that you want to extract some specific data from database files for example, what will you do than.

Bioinformatics softwares are made or programmed by specialists in the programming field using programming languages (c, c++, perl, phython, java...etc), i'm not saying that you have to learn them all, but PERL (Practical Extraction and Report Language), is the most powerful and ideal in Bioinformatics.

Why exactly PERL:

You may say that we have a lot of programming languages choices, why PERL, well we have already seen bioinformatics programs written in other languages such as (c, java, phython, FORTRAN...etc), but PERL is the best in the field because it can highly detects data patterns especially what we call STRINGs of text, so PERL is the best programming language for bioinformatics.

We mean by STRINGs characters of DNA/RNA or protein sequences (ATGATCCAGT for example).

I found this OREILLY book 'Beginning PERL For Bioinformatics' very helpful, and i advise that you read it to understand better how to design your own programs that are suited to your needs instead of using others programs.

Any question, comment.

Books: Beginning Perl for Bioinformatics

Beginning Perl for Bioinformatics

By: James Tisdall

Publisher: O'Reilly Media, Inc.

I found this book very helpful to understand the basics of using PERL to design programs that you need, to extract or manipulate data.

If you read this book you'll be able to use your own designed programs to parse database files and extract only what you need and even analyze DNA/RNA or protein data.

Table of Contents

Copyright

Preface

What Is Bioinformatics?

About This Book

Who This Book Is For

Why Should I Learn to Program?

Structure of This Book

Conventions Used in This Book

Comments and Questions

Acknowledgments

1. Biology and Computer Science

Section 1.1. The Organization of DNA

Section 1.2. The Organization of Proteins

Section 1.3. In Silico

Section 1.4. Limits to Computation

Chapter 2. Getting Started with Perl

Section 2.1. A Low and Long Learning Curve

Section 2.2. Perl's Benefits

Section 2.3. Installing Perl on Your Computer

Section 2.4. How to Run Perl Programs

Section 2.5. Text Editors

Section 2.6. Finding Help

Chapter 3. The Art of Programming

Section 3.1. Individual Approaches to Programming

Section 3.2. Edit—Run—Revise (and Save)

Section 3.3. An Environment of Programs

Section 3.4. Programming Strategies

Section 3.5. The Programming Process

Chapter 4. Sequences and Strings

Section 4.1. Representing Sequence Data

Section 4.2. A Program to Store a DNA Sequence

Section 4.3. Concatenating DNA Fragments

Section 4.4. Transcription: DNA to RNA

Section 4.5. Using the Perl Documentation

Section 4.6. Calculating the Reverse Complement in Perl

Section 4.7. Proteins, Files, and Arrays

Section 4.8. Reading Proteins in Files

Section 4.9. Arrays

Section 4.10. Scalar and List Context

Section 4.11. Exercises

Chapter 5. Motifs and Loops

Section 5.1. Flow Control

Section 5.2. Code Layout

Section 5.3. Finding Motifs

Section 5.4. Counting Nucleotides

Section 5.5. Exploding Strings into Arrays

Section 5.6. Operating on Strings

Section 5.7. Writing to Files

Section 5.8. Exercises

Chapter 6. Subroutines and Bugs

Section 6.1. Subroutines

Section 6.2. Scoping and Subroutines

Section 6.3. Command-Line Arguments and Arrays

Section 6.4. Passing Data to Subroutines

Section 6.5. Modules and Libraries of Subroutines

Section 6.6. Fixing Bugs in Your Code

Section 6.7. Exercises

Chapter 7. Mutations and Randomization

Section 7.1. Random Number Generators

Section 7.2. A Program Using Randomization

Section 7.3. A Program to Simulate DNA Mutation

Section 7.4. Generating Random DNA

Section 7.5. Analyzing DNA

Section 7.6. Exercises

Chapter 8. The Genetic Code

Section 8.1. Hashes

Section 8.2. Data Structures and Algorithms for Biology

Section 8.3. The Genetic Code

Section 8.4. Translating DNA into Proteins

Section 8.5. Reading DNA from Files in FASTA Format

Section 8.6. Reading Frames

Section 8.7. Exercises

Chapter 9. Restriction Maps and Regular Expressions

Section 9.1. Regular Expressions

Section 9.2. Restriction Maps and Restriction Enzymes

Section 9.3. Perl Operations

Section 9.4. Exercises

Chapter 10. GenBank

Section 10.1. GenBank Files

Section 10.2. GenBank Libraries

Section 10.3. Separating Sequence and Annotation

Section 10.4. Parsing Annotations

Section 10.5. Indexing GenBank with DBM

Section 10.6. Exercises

Chapter 11. Protein Data Bank

Section 11.1. Overview of PDB

Section 11.2. Files and Folders

Section 11.3. PDB Files

Section 11.4. Parsing PDB Files

Section 11.5. Controlling Other Programs

Section 11.6. Exercises

Chapter 12. BLAST

Section 12.1. Obtaining BLAST

Section 12.2. String Matching and Homology

Section 12.3. BLAST Output Files

Section 12.4. Parsing BLAST Output

Section 12.5. Presenting Data

Section 12.6. Bioperl

Section 12.7. Exercises

Chapter 13. Further Topics

Section 13.1. The Art of Program Design

Section 13.2. Web Programming

Section 13.3. Algorithms and Sequence Alignment

Section 13.4. Object-Oriented Programming

Section 13.5. Perl Modules

Section 13.6. Complex Data Structures

Section 13.7. Relational Databases

Section 13.8. Microarrays and XML

Section 13.9. Graphics Programming

Section 13.10. Modeling Networks

Section 13.11. DNA Computers

Appendix A. Resources

Section A.1. Perl

Section A.2. Computer Science

Section A.3. Linux

Section A.4. Bioinformatics

Section A.5. Molecular Biology

Appendix B. Perl Summary

Section B.1. Command Interpretation

Section B.2. Comments

Section B.3. Scalar Values and Scalar Variables

Section B.4. Assignment

Section B.5. Statements and Blocks

Section B.6. Arrays

Section B.7. Hashes

Section B.8. Operators

Section B.9. Operator Precedence

Section B.10. Basic Operators

Section B.11. Conditionals and Logical Operators

Section B.12. Binding Operators

Section B.13. Loops

Section B.14. Input/Output

Section B.15. Regular Expressions

Section B.16. Scalar and List Context

Section B.17. Subroutines and Modules

Section B.18. Built-in Functions

Index

Bioinformatics: Different Blast Programs

BLAST or (Basic Local Alignment Search Tool) is a set of programs that search for similar sequences to your query sequence, so you can find hundreds of similar sequences to yours in about 20 seconds.

Blast have a set of programs, each with a specific role:

BLASTN: Nucleotide query sequence against nucleotide sequence database.

BLASTP: Amino acid query sequence against a protein sequence database. you can find it HERE.

BLASTX: Nucleotide query sequence translated in all six reading frames against a protein sequence database.

TBLASTX: Six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

TBLASTN: Protein query sequence against a nucleotide sequence database translated in all six reading frames. you can find it HERE.

Or you can find them all at ch.EMBnet.org

You can find also other programs such as:

1- PSI-BLAST: Position Specific Iterative BLAST detects weak homologs by building a profile from a multiple alignment of the highest scoring hits in an initial BLAST search.
Available at NCBI .

2- PHI-BLAST: Pattern-Hit Initiated BLAST combines matching of regular expressions with local alignments surrounding the match.
Available at NCBI .

To learn how to use Blast to search for similarities, you can see this Video Tutorial HERE.

Any questions, you are welcome.

Sunday, January 17, 2010

Bioinformatics: How to install BioLinux 5.0

Bioinformatics: Tutorials and Lessons: How to install BioLinux 5.0

As i described before in this BioLinux 5.0 post, Biolinux 5.0 is a linux (ubuntu) environment that have +500 Bioinformatics softwares installed in it, and its free for all Bioinformatics students or researchers.

There are 4 ways to have BioLinux 5.0 working on your computer:

1- Install it directly in your computer with an empty hard drive.

2- Install it in dual boot with another operating system (Windows for example).

3- Install it on a virtual machine like VMware Workstation.

4- Download the virtual appliance directly, and play it with a virtual machine player.

- The first method doesn't work for all people because most of them have Windows already installed on their computers.

- The second method is good but people are afraid to damage their first operating system.

- The third method is great, but it takes some time to install it and configure it on the virtual machine.

Most of the newbies in the bioinformatics field think that installing Linux is a little bit complicated then installing Windows, so in this post i recommend you to use the 4th method, which is the easiest and the simplest even for someone who never installed a linux operating system.

The first thing to do is downloading the BioLinux 5.0 appliance from HERE.

The second thing is to extract the archive into your hard drive.

The third thing is to download the free VMWare Player from HERE.

The last thing is to open the file that have the extension " *.vmx " with VMWare Player.

Here is an overview of the operating system BioLinux 5.0 appliance :

Appliance Type:

Community

Description:

Guest OS config:
Distro: Bio-Linux 5.0 (Ubuntu 8.04.1 - Hardy Heron)
Kernel: 2.6.24-23-generic
Desktop WM: GNOME 2.22.3
Filesystem: ext3
Releasedate: January 12 2009

Virtual Machine config:
Virtual Disk: 40GB
Used Space: 6 GB
Networking: NAT
VMwaretools: 7.8.4-126130 installed
Resolution: Dynamic (default=1152x864)

Following tested and works:

- USB Mouse, USB Pendrive, USB Printer
- Sound (vmware ensonic driver)
- Video/Video (Firefox on CNN.com and Youtube.com)
- Internet (network: eth/dhcp)
- Cut n' Paste Drag n' Drop between Host/Guest installed and works perfect.

root ID: sudo
Password: bagside

Download: http (US server)
Compression: 7z

Features & Benefits

Standard install.

Pricing

Free

If you have any questions please comment.

Saturday, January 16, 2010

Bioinformatics: Tutorials & Lessons: Predict Protein Secondary Structure using SABLE Program

Protein structure is playing a major role in Bioinformatics especially structural Bioinformatics, so predicting protein structure can give us a lot of indispensable informations.

Proteins folds in 3 ways, that's why they have:

1- Primary structure: You can read this post about it HERE.

2- Secondary structure.

3- Tertiary structure or 3D: You can read this post about it HERE.

In this video tutorial i'm going to show you the best program to predict protein secondary structure, which is SABLE program.

Wednesday, January 13, 2010

Bioinformatics: Proteomics: Protein Primary structure

As you know in structural bioinformatics, analysing protein structure begins by analysing its primary structure then secondary structure, then tertiary structure.

Primary structure doesn't give us informations about protein interaction with each other as secondary and tertiary structure do, but it gives you informations about segments in your protein that display a special composition, so with these informations we can retrieve protein properties like:

1- Hydrophobic regions: generally found anchored into the membrane.

2- Hydrophylic regions: we find them outside, so they form the protein surface.

3- coiled-coil regions: that indicate the protein-protein interaction potential.

Any comments you're welcome.

Sunday, January 10, 2010

Bioinformatics: Tutorials & Lessons: Using ClustalW to do a multiple sequence alignment

In this video tutorial i'll be showing how to use ClustalW program to do a multiple sequence alignment.

You can read about multiple sequence alignment and ClustalW program in this post HERE.

If you want more informations about main multiple sequence alignment applications, you can read this post HERE.

Thursday, January 7, 2010

Bioinformatics: Genomics: Different Types of RNAs

RNAs are macromolecules which plays a major and necessary role in biology, they play a role of intermediary between DNA and Proteines .

RNAs can fold to secondary and even tertiary structures.

The main purpose to study RNAs in bioinformatics is to try to predict their structures, to know better about their interactions and their stability.

You can read about RNA structures in this post HERE.

RNAs have 2 main types:

1- Coding RNAs: Corresponding to mRNA (Messenger RNA) that plays a role of a transmitter, which transmits information from RNA and deliver it to Protein.

2- Non coding RNAs: Like rRNA (Ribosomal RNA), tRNA (Transfer RNA), snRNA...etc

mRNA : messenger RNA.
rRNA : ribosomal RNA.
tRNA : transfer RNA.
snRNA : (small nuclear) .
snoRNA : (small nucleolar ) .
scRNA : small cytoplasmic RNA.
tmRNA : transfer-messenger RNA.
siARN : small interfering RNA.

Any comments you're welcome.

Tuesday, January 5, 2010

Bioinformatics: Genomics: RNA secondary structure

As proteins can have a complex structures, RNAs too, because a major advance in biology in the 1970s had shown that RNAs can have a complex 2D and even 3D structures.

The good thing to hear is that RNAs obey folding patterns or laws that are much simpler then the complex protein folding laws.

In order for an RNA molecule to work, it has to be protected from solvents, to do that, RNA bases pair themselves with other bases, this pairing forms RNA secondary structure.

When the two RNA stretches (we're talking about one RNA molecule) are perfectly compatible, or complementary to each other, they form what's called STEM.

Note: STEMs don't have to be 100% compatible, so we can find also unpaired residues.

When the stretches aren't compatible they form what's called a LOOP.

Tertiary interactions may also occur in an RNA molecule, but its very difficult to predict there tertiary interactions.

Any comments you're welcome.

Saturday, January 2, 2010

Bioinformatics: Proteomics: Protein 3D structure

As we all know the succession of amino acids in a protein sequence is what defines the protein structure, so the 3D structure of a protein sequence is a result of its amino acids succession, because for example, the Hydrophobic amino acids have no desire to interact with water, so they won't be on the surface, on the other hand the Hydrophylic amino acids or residues will appear on the surface to interact with water for example.

The protein 3D structure is not defined only by the previous properties but also the electric charge of amino acids, their interaction with their neighbors...etc

The man rule in the Structural Bioinformatics field is "similar sequences = similar shapes or 3D structures & similar shapes or 3D structures = similar sequences".

So the relationship will be like this:

Sequence ---> Structure ---> Function

The sequence identifies the structure which identifies the function.

The field that studies all of this is called Structural Bioinformatics.

We can identify the protein 3D structure by using 2 distinct methods:

1- The experimental: In the lab by doing an X-ray crystallography for example.
2- The theoretical: By predicting the structure from the sequence by using specialized bioinformatics tools.

Predictin protein 2D structure is now easy, but 3D structures still make an obstacle to Bioinformaticiens because of its complexity.

To read about protein databases you can read this article HERE.
To learn more about 3D structural databases you can read about PDB database HERE.

Any comments you're welcome.

Do you find this blog helpful?

Bioinformatics BookStore

Flu.gov

FeedBurner FeedCount

Kontera

Saturday, April 24, 2010

ADVERTISEMENTS

Tuesday, March 23, 2010

ADVERTISEMENTS

ADVERTISEMENTS

Monday, March 22, 2010

ADVERTISEMENTS

Friday, March 19, 2010

ADVERTISEMENTS

Sunday, March 14, 2010

ADVERTISEMENTS

Sunday, March 7, 2010

ADVERTISEMENTS

Saturday, February 27, 2010

ADVERTISEMENTS

Wednesday, February 24, 2010

ADVERTISEMENTS

Sunday, February 21, 2010

ADVERTISEMENTS

Friday, February 19, 2010

ADVERTISEMENTS

Tuesday, February 16, 2010

ADVERTISEMENTS

Sunday, February 14, 2010

ADVERTISEMENTS

Friday, February 12, 2010

ADVERTISEMENTS

Wednesday, February 10, 2010

ADVERTISEMENTS

Monday, February 8, 2010

ADVERTISEMENTS

Saturday, February 6, 2010

ADVERTISEMENTS

Thursday, February 4, 2010

ADVERTISEMENTS

Tuesday, February 2, 2010

ADVERTISEMENTS

Sunday, January 31, 2010

ADVERTISEMENTS

Friday, January 29, 2010

ADVERTISEMENTS

Wednesday, January 27, 2010

ADVERTISEMENTS

Friday, January 22, 2010

ADVERTISEMENTS

ADVERTISEMENTS

ADVERTISEMENTS

Sunday, January 17, 2010

ADVERTISEMENTS

Saturday, January 16, 2010

ADVERTISEMENTS

Wednesday, January 13, 2010

ADVERTISEMENTS

Sunday, January 10, 2010

ADVERTISEMENTS

Thursday, January 7, 2010

ADVERTISEMENTS

Tuesday, January 5, 2010

ADVERTISEMENTS

Saturday, January 2, 2010

ADVERTISEMENTS

Most Popular Lessons

Pages

Chitika

Subscribe via email

Subscribe To

Labels

Bioinformatics made easy

Blog Archive

StatCounter