Added: Theadore Castellano - Date: 14.06.2022 06:33 - Views: 30620 - Clicks: 8001
Metrics details. Repbase is a reference database of eukaryotic repetitive DNA, which includes prototypic sequences of repeats and basic information described in annotations. Updating and maintenance of the database requires specialized tools, which we have created and made available for use with Repbase, and which may be useful as a template for other curated databases. We describe the software tools RepbaseSubmitter and Censor, which are deed to facilitate updating and screening the content of Repbase.
RepbaseSubmitter is a java-based interface for formatting and annotating Repbase entries. It eliminates many common formatting errors, and automates actions such as calculation of sequence lengths and composition, thus facilitating curation of Repbase sequences. In addition, it has several features for predicting protein coding regions in sequences; searching and including Pubmed references in Repbase entries; and searching the NCBI taxonomy database for correct inclusion of species information and taxonomic position. Censor is a tool to rapidly identify repetitive elements by comparison to known repeats.
Defragmented output includes a map of repeats present in the query sequence, with the options to report masked query sequence srepeat sequences found in the query, and alignments.
Censor and RepbaseSubmitter are available as both web-based services and downloadable versions. The current version of Repbase is based on a flexible and extensible relational database schema implemented in mySQL. Ongoing large-scale sequencing of eukaryotic genomes has resulted in a rapid increase in the rate at which new transposable elements are discovered. Rather than relying on error-prone automated processing, the philosophy behind Repbase has been to incorporate a ificant amount of manual curation into the database.
However, the increasing of sequences to be annotated and entered led us to develop a standardized submission interface that external users can use to provide information on their sequences, with a minimum of subsequent reformatting being necessary. Repbase is primarily being used for screening and annotation of genomic DNA. Censor was the first program for Repbase-based repeat detection and masking, originally released in and later published [ 2 ].
Its major drawback was inefficient implementation of the Smith-Waterman algorithm and, therefore, the publicly accessible Censor server ran exclusively on specialized Paracel hardware. In the meantime, other programs, notably RepeatMasker [ 34 ], and blaster [ 5 ] became available. RepeatMasker uses a customized version of the Repbase library that can sometimes have ificant differences from the original Repbase submission. Furthermore, Censor can be used to search DNA sequences against a library of proteins, or translated nucleotide sequences.
Manual curation of databases has both advantages and drawbacks compared to automated processing. Automatic annotation has the ificant advantage of much higher potential throughput, freedom from user error, and elimination of unintended bias in the processing. On the other hand, it is hard to anticipate every contingency in, for example, correct reconstruction of consensus sequences. A particular problem with automated reconstruction of transposable elements is over-fragmentation, where algorithms do not correctly assemble related parts of an element into a complete consensus.
The principal source of mistakes in manual curation is user error in entering data. All complex data such as taxonomy, literature references, transposable element classifications are potentially problematic, since simple misspellings can render a database entry unretrievable based on exact string-based searches. For these reasons, we have chosen to adopt a hybrid approach: keeping the positive aspects of manual curation, while attempting to eliminate the most common sources of user-supplied errors, by automating the import and annotation of complex, but well-defined information including taxonomic information, referencing, etc.
The purpose of RepbaseSubmitter is to provide an easy to use interface that permits flexibility in annotation, while at the same time reducing the scope for mistakes in the manual curation process. The interface is structured around six data entry s, together with an initialization for creating a new entry, and a final submission for performing checks and submitting to Repbase. New entries are not directly entered into Repbase, but are submitted to a review database for editorial approval and additional curation. It can also utilize symmetric multiprocessor machines.
Censor offers three sensitivity modes: normal, rough and sensitive, each offering a different balance of sensitivity and speed. Censor automatically determines the type NA or protein of input sequences by calculating base composition, and calls the appropriate BLAST program, although this behavior can be overridden as described below. Censor relies on some standard UNIX system commands. If the BLAST installation directory is on the user's path, the configuration script will automatically detect it and as corresponding variables. Otherwise, users must manually edit the header of Censor's main script to provide this information.
At all stages of data entry using the submission interface, required fields are indicated by boxes highlighted in red. Although the data entry forms can be accessed in any order, if required information is omitted, the program will not allow the user to proceed until it has been entered. The entry forms of RepbaseSubmitter, and the main information that can be entered through them, are summarized in Table 1.
The Initialization Select allows the user to begin creation of a new Repbase sequence by loading data from a pre-existing file, or by starting with a completely blank template. After this initial selection, the Summary data entry is displayed. The primary fields required for creation of a new entry include a Repbase accession. The format of accession s is not fixed, and is user-defined; however, it must be unique.
This Repbase identifier can be considered analogous to a HUGO gene name, rather than an abstract database entity such as a Genbank accession. There is no currently accepted standard of asing of names to transposable elements.
The Summary also requires a description of the sequence being submitted. A comments section is also available for a more detailed description of the sequence, and is not limited in scope. Examples of such information might include of copies of the sequence in a genome; age distribution of transposable elements e. Finally, it is possible to specify free-form keywords which provide pertinent information specific to this sequence. Repbase entries can be searched by keyword, so a user may wish to specify information such as characterization of protein coding domains present in the sequence e.
The keyword field is also used internally by Repbase to indicate links to corresponding RepeatMasker library entries. The Summary also notes the IP address of the computer submitting the data to Repbase — this is not user-editable. The Organism entry ensures that correct taxonomy of entries is maintained; both at the level of species, and for classes of repeat element. As species name is typed, RepbaseSubmitter dynamically searches the NCBI taxonomy tree [ 6 ] and lists matching entries.
The species can be selected from the list as soon as the correct one appears, or can be typed fully — the more of the species name that is typed, the narrower the list presented. Once a specific species has been selected, the interface pulls the correct taxonomic classification from the NCBI Taxonomy database, and enters this information in the relevant field.
In addition, this section of the interface facilitates correct classification of transposable entries. The current classification scheme implemented in Repbase is given in Table 2however the scheme is transparently extensible as new superfamilies of transposable element are identified. The status of the sequence as an autonomous or non-autonomous element can also be specified at this point. If non-autonomous, the corresponding mobilizing element may be indicated. The Sequence entry is the simplest, and requires only the sequence data to be input. Otherwise, sequence data can be cut-and-pasted into the window.
The base count and composition of the sequence is automatically updated and entered. Sequences can also be complemented, if it is determined that the other strand is more appropriate for example, if it encodes proteins for autonomous elements. Autonomous transposable elements encode proteins such as transposase, reverse transcriptase, endonucleases, etc.
This information is often of interest to researchers using Repbase, and the Proteins interface shown in Fig. Multiple proteins can be specified for the same Repbase entry, and therefore it is necessary to supply a unique Repbase protein identifier. One is generated automatically for each ORF added — users may choose to specify their own identifier, but it must be unique in Repbase, and will be checked at the final stage before to the review database. A comment field is associated with each protein entry on a sequence.
Coordinates of coding regions can be entered manually, and the corresponding region will be translated and entered as the coding sequence. However a useful feature of the Protein annotation is the ability to predict ORFs.
Upon selecting the "Predict" option on thisthe user is prompted to specify how many ORFs, Nare anticipated. The program will graphically display the N longest ORFs on all strands, along with their corresponding coordinates in the sequence. The user can select an ORF to add to the Repbase entry as a putative protein coding region; in addition, several fragments of ORFs can be merged together as one coding region if it is anticipated that they are part of the same protein.
This is generally only recommended if resulting gaps are small. Finally, an option is provided to truncate a specified coding region to the first occurring Methionine. Protein annotation entry form of RepbaseSubmitter. The protein prediction sub-window is also shown, showing how ORFs can be predicted and merged into a predicted protein for annotation on the nucleotide sequence.
The bottom of the main window shows access buttons for each entry of the program. RepbaseSubmitter is written in java, and can run on any system with an installed Java Virtual Machine of version 1. An important feature of Repbase is the ability to supply references to appropriate scientific literature, or to other Repbase entries and other databases. The submission interface facilitates both types of referencing. References to scientific literature can be added manually i.
As an alternative, RepbaseSubmitter provides an "Import" option on the References entry. This allows users to specify partial information such as author names, article title, journal name, and then search the NCBI Pubmed database [ 7 ]. A list of matching references is returned, and multiple selections can be made from this list and included in the Repbase entry. In this way, references to literature will correspond exactly to how they appear in Pubmed, which can substantially eliminate errors due to mistyping of reference information.
In some cases, a particular reference may only apply to part of a sequence. This is often true if the sequence currently being entered is an extension of a ly-existing partial Repbase entry; or if the element being annotated combines information that has been reported fragmentarily in multiple locations.
In this case, the user needs to supply the author information manually. If the creation of this Repbase entry represents new work, the user will generally want to supply a title, and submit it to Repbase Reports. Entries already described in another publication should be directed to Repbase Update. Repbase Reports provides a medium for publication of novel transposable elements in an online journal form, so that the work may be referred to in other publications.
Finally, the Reference provides an option for "Free Text" references, for those cases which do not correspond to traditional journal references, or links to those databases specifically recognized by Repbase. The Release and Accessions summarizes the information supplied on the Referencesprimarily to allow selection of a primary reference for sequences which are consensi. Additionally, it is possible to specify a "creation date" for this Repbase entry generally the current date ; and a "last update" which will be the same as the creation data for a new sequence, but may be different if this is a refinement of a pre-existing Repbase element.
This section is also the appropriate place to specify accession s linking to other databases Genbank etc. The last screen of the submission interface is for actual submission to the Repbase review database. The database entry as it will appear in native Repbase EMBL format is displayed, and may be saved to a file. Upon selecting "submit", the entry is checked for correct formatting, and basic consistency such as unique Repbase accession and sequence information; and is then entered into the mySQL database for approval.
Before performing each search, input data is checked and formatted. Censor automatically chops long sequences into smaller fragments to reduce BLAST memory requirements and to facilitate splitting of jobs on multiple processor machines. Base composition is calculated for query and database sequences, and based on the total percentage of ATCGN bases, Censor decides whether each sequence is nucleotide or protein. By default, simple tandem repeats are masked using filter modules prior to similarity searching, to prevent false hits.
Two approaches are available for dealing with simple repeats. However this prevents identification of simple repeats in the Censor output. Another approach is to mask them by first BLASTing the query sequence against a library of simple repeats, which is included with the Censor distribution. In this case simple repeats will be reported in the program's output.
Both filtering functions can be disabled if required, but this is not recommended, since it can lead to a ificant proportion of false hits between the query sequence and simple repeats that are internal parts of repetitive elements curated in Repbase. However, disabling annotation of simple repeats can lead to a ificant decrease in overall processing time.
In the main search phase, Censor uses BLAST to compare the input sequence to annotated repetitive elements in Repbase, or a custom user-supplied library. Both versions have their advantages and disadvantages. The query sequence is scanned against each library of repeats specified using Censor's "-lib" option, in the order in which they are listed. After processing each library, detected repeats are masked out from the query sequence before comparison to the next library.
Censor performs post-processing of BLAST output by removing overlaps and defragmenting detected repeats. The program reports positions of repetitive elements in ". Figure 2 shows an example of a repeat map. Many methods for evaluating the similarity between two or more homologous sequences exist [ 10 — 12 ]. In the case of transposable elements, even a large indel insertion or deletionwhich corresponds to any uninterrupted alignment gap, can reflect one event in evolution transpositional insertion or excision and should impact the value of similarity the same way unrelated to its length.
In addition to this measure, the Censor output incorporates an alternative similarity measure Posthat is calculated on the basis of positive scores between aligned base pairs.I am looking for a ltr with a submissive
email: [email protected] - phone:(327) 672-1764 x 7575
Dominant Man's Online Search