Cd-hit

From wiki
Jump to: navigation, search

Introduction

CD-HIT is primarily a clustering program which, for input, takes fasta sequence files which are being envisioned as databases against which query sequence files will search.

A major concern with such fasta files is the level of redundancy they may have. Depending on the experiment or analysis being run, the degree of detail in the database file may be too high, and there is a benefit to clustering sequences that are similar. CD-HIT is used for this.

Common use-cases

Clustering the Antibiotic Resistance Gene database

cdhit-est -i argannot-nt_doc.fasta -o argannot_cdhit90 -d 0 > argannot_cdhit90.stdout