Difference between revisions of "Repeatmasker"

From wiki
Jump to: navigation, search
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Installation requirements =
+
= Inroduction =
 
 
* variouus Perl modules
 
* trf: Tandem Repeats Finder, only seems necessary for the subprogram '''RepeatProteinMask'''
 
* One of the following: '''cross_match''', '''wublast''' y '''rmblast'''. However, cross_match not recommeneded (slow code in C from 1998). Best off using the 64-bit '''rmblast''' binary.
 
* The Repeat library/database from [http://www.girinst.com GIRI].
 
 
 
= Installation method =
 
* Unfortunately it is interactive which means it will be geared towards the single computer installation, which is fine if using a laptop, but not a cluster. Also the tab complete won't work, which is annoying.
 
 
 
Typical messages that are output:
 
* Building monolithic RM database ...
 
* Building RMBlast frozen libraries ...
 
  
Lo recomendado es uitlizar RMBlast, pero hay una opción para incluir '''nhmmer''' y '''DFAM''' también .. puede ser util. No mencionan que '''nhmmerscan''' también es parte de '''nhmmer''', y los dos forman parte del '''HMMER'''.
+
Mainstay repeat analysis program. Porbably getting a it old now, depends too much on RepBase,
 +
but equally it is building alighnment with Dfam (which is a more modern approach to repeat finding.
  
Mensaje al final:
+
Documentation on usgae doesn't seem to be great, but there is at least this:
Add a Search Engine:
+
* [http://www.repeatmasker.org/webrepeatmaskerhelp.html webhelp]
  1. CrossMatch: [ Un-configured ]
 
  2. RMBlast - NCBI Blast with RepeatMasker extensions: [ Configured, Default ]
 
  3. WUBlast/ABBlast (required by DupMasker): [ Un-configured ]
 
  4. HMMER3.1 & DFAM: [ Configured ]
 
  5. Done
 
 
 
Enter Selection: 5
 
  -- Setting perl interpreter...
 
 
 
Congratulations!  RepeatMasker is now ready to use.
 
The program is installed with a full version of the repeat library:
 
  DFAM Library Version = Dfam_1.2
 
  RMLibrary Version = 20140131
 
  Repbase Version = 20140131
 
Further documentation on the program may be found here:
 
  /share/apps/src/RepeatMasker-open-4-0-5/repeatmasker.help
 
  
 
= Typical launch command lines =
 
= Typical launch command lines =
Line 42: Line 15:
 
  RepeatMasker -species yeast -pa 8 -dir rp0 -gff -e ncbi -s W303LYZE.fasta
 
  RepeatMasker -species yeast -pa 8 -dir rp0 -gff -e ncbi -s W303LYZE.fasta
  
 +
== output ==
 +
 +
Here is what normal output output looks like
 +
 +
==================================================
 +
file name: S288_maniid.fsa   
 +
sequences:            17
 +
total length:  12157105 bp  (12157105 bp excl N/X-runs)
 +
GC level:        38.15 %
 +
bases masked:    618839 bp ( 5.09 %)
 +
==================================================
 +
                number of      length  percentage
 +
                elements*    occupied  of sequence
 +
--------------------------------------------------
 +
Retroelements          500      401795 bp    3.31 %
 +
    SINEs:                0            0 bp    0.00 %
 +
    Penelope              0            0 bp    0.00 %
 +
    LINEs:                0            0 bp    0.00 %
 +
    CRE/SLACS            0            0 bp    0.00 %
 +
      L2/CR1/Rex          0            0 bp    0.00 %
 +
      R1/LOA/Jockey      0            0 bp    0.00 %
 +
      R2/R4/NeSL          0            0 bp    0.00 %
 +
      RTE/Bov-B          0            0 bp    0.00 %
 +
      L1/CIN4            0            0 bp    0.00 %
 +
    LTR elements:      500      401795 bp    3.31 %
 +
      BEL/Pao            0            0 bp    0.00 %
 +
      Ty1/Copia        444      377343 bp    3.10 %
 +
      Gypsy/DIRS1        56        24452 bp    0.20 %
 +
        Retroviral        0            0 bp    0.00 %
 +
 +
DNA transposons          0            0 bp    0.00 %
 +
    hobo-Activator        0            0 bp    0.00 %
 +
    Tc1-IS630-Pogo        0            0 bp    0.00 %
 +
    En-Spm                0            0 bp    0.00 %
 +
    MuDR-IS905            0            0 bp    0.00 %
 +
    PiggyBac              0            0 bp    0.00 %
 +
    Tourist/Harbinger    0            0 bp    0.00 %
 +
    Other (Mirage,        0            0 bp    0.00 %
 +
    P-element, Transib)
 +
 +
Rolling-circles          0            0 bp    0.00 %
 +
 +
Unclassified:          19        50995 bp    0.42 %
 +
 +
Total interspersed repeats:      452790 bp    3.72 %
 +
 +
 +
Small RNA:              6        12034 bp    0.10 %
 +
 +
Satellites:              0            0 bp    0.00 %
 +
Simple repeats:      2981      128642 bp    1.06 %
 +
Low complexity:        536        25466 bp    0.21 %
 +
==================================================
 +
 +
* most repeats fragmented by insertions or deletions
 +
  have been counted as one element
 +
   
 +
The query species was assumed to be saccharomyces cerevisiae
 +
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
 +
   
 +
run with rmblastn version 2.6.0+
 +
 +
Note that the running output (the above is only a summary) mentions Ecoli lot, so it's comforting to know that it analysed the right species.
 +
 +
== result files ==
 +
 +
You should have the following files as output (names reflect above experiments, yours will differ)
 +
 +
* S288_maniid.fsa.cat.gz
 +
* S288_maniid.fsa.masked
 +
* S288_maniid.fsa.out
 +
* S288_maniid.fsa.out.gff (this is gff2 format)
 +
* S288_maniid.fsa.tbl, a summary table of the run (actually given below)
  
 
= RepeatMasker's helpfile =
 
= RepeatMasker's helpfile =
Line 221: Line 267:
 
     Robert Hubley <rhubley@systemsbiology.org>
 
     Robert Hubley <rhubley@systemsbiology.org>
  
 +
= Installation requirements =
 +
 +
* variouus Perl modules
 +
* trf: Tandem Repeats Finder, only seems necessary for the subprogram '''RepeatProteinMask'''
 +
* One of the following: '''cross_match''', '''wublast''' y '''rmblast'''. However, cross_match not recommeneded (slow code in C from 1998). Best off using the 64-bit '''rmblast''' binary.
 +
* The Repeat library/database from [http://www.girinst.com GIRI].
 +
 +
= Installation method =
 +
* Unfortunately it is interactive which means it will be geared towards the single computer installation, which is fine if using a laptop, but not a cluster. Also the tab complete won't work, which is annoying.
 +
 +
Typical messages that are output:
 +
* Building monolithic RM database ...
 +
* Building RMBlast frozen libraries ...
 +
 +
Lo recomendado es uitlizar RMBlast, pero hay una opción para incluir '''nhmmer''' y '''DFAM''' también .. puede ser util. No mencionan que '''nhmmerscan''' también es parte de '''nhmmer''', y los dos forman parte del '''HMMER'''.
 +
 +
Mensaje al final:
 +
Add a Search Engine:
 +
  1. CrossMatch: [ Un-configured ]
 +
  2. RMBlast - NCBI Blast with RepeatMasker extensions: [ Configured, Default ]
 +
  3. WUBlast/ABBlast (required by DupMasker): [ Un-configured ]
 +
  4. HMMER3.1 & DFAM: [ Configured ]
 +
  5. Done
 +
 
 +
Enter Selection: 5
 +
  -- Setting perl interpreter...
 +
 
 +
Congratulations!  RepeatMasker is now ready to use.
 +
The program is installed with a full version of the repeat library:
 +
  DFAM Library Version = Dfam_1.2
 +
  RMLibrary Version = 20140131
 +
  Repbase Version = 20140131
 +
Further documentation on the program may be found here:
 +
  /share/apps/src/RepeatMasker-open-4-0-5/repeatmasker.help
 
== Repeat Library (RepBase) Updating ==
 
== Repeat Library (RepBase) Updating ==
  

Latest revision as of 18:03, 2 November 2017

Inroduction

Mainstay repeat analysis program. Porbably getting a it old now, depends too much on RepBase, but equally it is building alighnment with Dfam (which is a more modern approach to repeat finding.

Documentation on usgae doesn't seem to be great, but there is at least this:

Typical launch command lines

This command slow searches with 8 processors, defines an output directory called rp0, requests gff output and uses the ncbi engine:

Beware: the output directory must be created beforehand.

RepeatMasker -species yeast -pa 8 -dir rp0 -gff -e ncbi -s W303LYZE.fasta

output

Here is what normal output output looks like

==================================================
file name: S288_maniid.fsa    
sequences:            17
total length:   12157105 bp  (12157105 bp excl N/X-runs)
GC level:         38.15 %
bases masked:     618839 bp ( 5.09 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements          500       401795 bp    3.31 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:       500       401795 bp    3.31 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia         444       377343 bp    3.10 %
     Gypsy/DIRS1        56        24452 bp    0.20 %
       Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:           19        50995 bp    0.42 %

Total interspersed repeats:      452790 bp    3.72 %


Small RNA:               6        12034 bp    0.10 %

Satellites:              0            0 bp    0.00 %
Simple repeats:       2981       128642 bp    1.06 %
Low complexity:        536        25466 bp    0.21 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
    
The query species was assumed to be saccharomyces cerevisiae
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
    
run with rmblastn version 2.6.0+

Note that the running output (the above is only a summary) mentions Ecoli lot, so it's comforting to know that it analysed the right species.

result files

You should have the following files as output (names reflect above experiments, yours will differ)

  • S288_maniid.fsa.cat.gz
  • S288_maniid.fsa.masked
  • S288_maniid.fsa.out
  • S288_maniid.fsa.out.gff (this is gff2 format)
  • S288_maniid.fsa.tbl, a summary table of the run (actually given below)

RepeatMasker's helpfile

RepeatMasker version open-4.0.7
No query sequence file indicated

NAME
    RepeatMasker - Mask repetitive DNA

SYNOPSIS
      RepeatMasker [-options] <seqfiles(s) in fasta format>

DESCRIPTION
    The options are:

    -h(elp)
        Detailed help

    Default settings are for masking all type of repeats in a primate
    sequence.

    -e(ngine) [crossmatch|wublast|abblast|ncbi|hmmer|decypher]
        Use an alternate search engine to the default.

    -pa(rallel) [number]
        The number of processors to use in parallel (only works for batch
        files or sequences over 50 kb)

    -s  Slow search; 0-5% more sensitive, 2-3 times slower than default

    -q  Quick search; 5-10% less sensitive, 2-5 times faster than default

    -qq Rush job; about 10% less sensitive, 4->10 times faster than default
        (quick searches are fine under most circumstances) repeat options

    -nolow /-low
        Does not mask low_complexity DNA or simple repeats

    -noint /-int
        Only masks low complex/simple repeats (no interspersed repeats)

    -norna
        Does not mask small RNA (pseudo) genes

    -alu
        Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)

    -div [number]
        Masks only those repeats < x percent diverged from consensus seq

    -lib [filename]
        Allows use of a custom library (e.g. from another species)

    -cutoff [number]
        Sets cutoff score for masking repeats when using -lib (default 225)

    -species <query species>
        Specify the species or clade of the input sequence. The species name
        must be a valid NCBI Taxonomy Database species name and be contained
        in the RepeatMasker repeat database. Some examples are:

          -species human
          -species mouse
          -species rattus
          -species "ciona savignyi"
          -species arabidopsis

        Other commonly used species:

        mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,
        danio, "ciona intestinalis" drosophila, anopheles, elegans,
        diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize

    Contamination options

    -is_only
        Only clips E coli insertion elements out of fasta and .qual files

    -is_clip
        Clips IS elements before analysis (default: IS only reported)

    -no_is
        Skips bacterial insertion element check

    Running options

    -gc [number]
        Use matrices calculated for 'number' percentage background GC level

    -gccalc
        RepeatMasker calculates the GC content even for batch files/small
        seqs

    -frag [number]
        Maximum sequence length masked without fragmenting (default 60000,
        300000 for DeCypher)

    -nocut
        Skips the steps in which repeats are excised

    -noisy
        Prints search engine progress report to screen (defaults to .stderr
        file)

    -nopost
        Do not postprocess the results of the run ( i.e. call ProcessRepeats
        ). NOTE: This options should only be used when ProcessRepeats will
        be run manually on the results.

    output options

    -dir [directory name]
        Writes output to this directory (default is query file directory,
        "-dir ." will write to current directory).

    -a(lignments)
        Writes alignments in .align output file

    -inv
        Alignments are presented in the orientation of the repeat (with
        option -a)

    -lcambig
        Outputs ambiguous DNA transposon fragments using a lower case name.
        All other repeats are listed in upper case. Ambiguous fragments
        match multiple repeat elements and can only be called based on
        flanking repeat information.

    -small
        Returns complete .masked sequence in lower case

    -xsmall
        Returns repetitive regions in lowercase (rest capitals) rather than
        masked

    -x  Returns repetitive regions masked with Xs rather than Ns

    -poly
        Reports simple repeats that may be polymorphic (in file.poly)

    -source
        Includes for each annotation the HSP "evidence". Currently this
        option is only available with the "-html" output format listed
        below.

    -html
        Creates an additional output file in xhtml format.

    -ace
        Creates an additional output file in ACeDB format

    -gff
        Creates an additional Gene Feature Finding format output

    -u  Creates an additional annotation file not processed by
        ProcessRepeats

    -xm Creates an additional output file in cross_match format (for
        parsing)

    -no_id
        Leaves out final column with unique ID for each element (was
        default)

    -e(xcln)
        Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in
        the query

SEE ALSO
        Crossmatch, ProcessRepeats

COPYRIGHT
    Copyright 2007-2014 Arian Smit, Institute for Systems Biology

AUTHORS
    Arian Smit <asmit@systemsbiology.org>

    Robert Hubley <rhubley@systemsbiology.org>

Installation requirements

  • variouus Perl modules
  • trf: Tandem Repeats Finder, only seems necessary for the subprogram RepeatProteinMask
  • One of the following: cross_match, wublast y rmblast. However, cross_match not recommeneded (slow code in C from 1998). Best off using the 64-bit rmblast binary.
  • The Repeat library/database from GIRI.

Installation method

  • Unfortunately it is interactive which means it will be geared towards the single computer installation, which is fine if using a laptop, but not a cluster. Also the tab complete won't work, which is annoying.

Typical messages that are output:

  • Building monolithic RM database ...
  • Building RMBlast frozen libraries ...

Lo recomendado es uitlizar RMBlast, pero hay una opción para incluir nhmmer y DFAM también .. puede ser util. No mencionan que nhmmerscan también es parte de nhmmer, y los dos forman parte del HMMER.

Mensaje al final:
Add a Search Engine:
  1. CrossMatch: [ Un-configured ]
  2. RMBlast - NCBI Blast with RepeatMasker extensions: [ Configured, Default ]
  3. WUBlast/ABBlast (required by DupMasker): [ Un-configured ]
  4. HMMER3.1 & DFAM: [ Configured ]
  5. Done
 
Enter Selection: 5
 -- Setting perl interpreter...
 
Congratulations!  RepeatMasker is now ready to use.
The program is installed with a full version of the repeat library:
 DFAM Library Version = Dfam_1.2
 RMLibrary Version = 20140131
 Repbase Version = 20140131
Further documentation on the program may be found here:
 /share/apps/src/RepeatMasker-open-4-0-5/repeatmasker.help

Repeat Library (RepBase) Updating

Access (the first time) must be requested from GIRINST


  • EMBL format (59.08 MB) 11-10-2012: "Local: RepBase17.11.embl.tar.gz"
  • FASTA format (28.76 MB) 11-10-2012: "Local: RepBase17.11.fasta.tar.gz"
  • Repeatmasker editions: "Local:repeatmaskerlibraries-20090604.tar.gz (11.27 MB)" y "Local:repeatmaskerlibraries-20120418.tar.gz (26.76 MB)". Creo que solo hay que elegir uno de estos dos
  • REPET edition: "Local:RepBase17.11_REPET.embl.tar.gz (28.77 MB)"

Efectivamente, sólo se requiere uno de estos ficheros: el "repeatmaskerlibraries-20120418.tar.gz" que contiene el fichero RepeatMaskerLib.embl que tiene el mismo nombre y se encuentra en el mismo directorio del que vino con el propio RepeatMasker, pero que es mucho más grande.

De todas formas, Repeatmasker se va quejar si no es el RepeatMaskerLib.embl del GIRI.

¿Y dónde meter este fichero? En el subdirectorio "Libraries" del source de RepeatMasker

Para actualizar estas BBDD, se acude al sitio web de giri y se utiliza el userid ramonf con contraseña u9xyvu.

Modificaciones a los principales scripts

Los principales scripts son: Ha que pedir permiso para registrarse en la web de GIRI y descargar los ficheros. Los ficheros son los siguientes:

  • RepeatMasker
  • ProcessRepeats
  • DateRepeats

Hay que cambiar la primera línea al interprete de Perl.

Por otro lado, es necesario informar a RepeatMasker sobre la ubicación exacta de los principales ejecutables, pero al contrario de los que dicen los documentos, se deben identificar en el fichero RepeatMaskerConfig.tmpl en vez del fichero llamado RepeatMasker.

Pequeño Test

Sólo queremos asegurarnos que se puede ejecutar el Repeatmasker sin arrojar errores de la siguiente manera. El propio ejecutable RepeatMasker es un perl script. Primero es bueno asegurarse que el RepeatMasker está en el PATH del usuario. En el fatnode, el PATH de RepeatMasker es

/opt/src/RepeatMasker-open-3-3-0-p1/

Repeatmasker tiene varias opciones, pero para una análisis rápido, la opción -gccalc se puede usar. Por tanto, si teneos el siguietne fichero de entrada

>Sequence1
ACGTGCGCGATCGCCTGCTAGGCGTACGTCGCAG
GCACTGGCAGATCGATGTGCTAGATCAGATGACA
>Sequence2
GGGCTATTCCGATTAGCACCACATACATCGCTCA

con el nombre in.seq podemos ejecutar lo siguiente:

RepeatMasker -gccalc in.seq

Este fichero no va a tener un resultado sustancial para el programa, pero el objetivo era encontrar errores de instalación de RepeatMasker y nada más.