Difference between revisions of "Cd-hit"

From wiki
Jump to: navigation, search
(Created page with "=Introduction= CD-HIT is primarily a clustering program which, for input, takes fasta sequence files which are being envisioned as databases against which query sequence file...")
 
(No difference)

Latest revision as of 16:23, 1 May 2016

Introduction

CD-HIT is primarily a clustering program which, for input, takes fasta sequence files which are being envisioned as databases against which query sequence files will search.

A major concern with such fasta files is the level of redundancy they may have. Depending on the experiment or analysis being run, the degree of detail in the database file may be too high, and there is a benefit to clustering sequences that are similar. CD-HIT is used for this.

Common use-cases

Clustering the Antibiotic Resistance Gene database

cdhit-est -i argannot-nt_doc.fasta -o argannot_cdhit90 -d 0 > argannot_cdhit90.stdout