Tools

One of our main research interests is developing new webtools and databases in the field of bioinformatics. We believe that this work transcends the technical effort, since it allows improvement in our data analysis methods and overall, we introduce to the community a little of what is generated in our lab. Below are our available tools:

miRIAD

MicroRNAs (miRNAs) are a class of small (~22 nucleotides) non-coding RNAs that post-transcriptionally regulate gene expression by interacting with target mRNAs. A majority of miRNAs is located within intronic or exonic regions of protein-coding genes (host genes), and increasing evidence suggests a functional relationship between these miRNAs and their host genes. Here, we introduce miRIAD, a web-service to facilitate the analysis of genomic and structural features of intragenic miRNAs and their host genes for five species (human, rhesus monkey, mouse, chicken and opossum). miRIAD contains the genomic classification of all miRNAs (inter- and intragenic), as well as classification of all protein-coding genes into host or non-host genes (depending on whether they contain an intragenic miRNA or not). We collected and processed public data from several sources to provide a clear visualization of relevant knowledge related to intragenic miRNAs, such as host gene function, genomic context, names of and references to intragenic miRNAs, miRNA binding sites, clusters of intragenic miRNAs, miRNA and host gene expression across different tissues and expression correlation for intragenic miRNAs and their host genes. Protein–protein interaction data are also presented for functional network analysis of host genes. In summary, miRIAD was designed to help the research community to explore, in a user-friendly environment, intragenic miRNAs, their host genes and functional annotations with minimal effort, facilitating hypothesis generation and in-silico validations.

piRNAdb

Found in several metazoan species, piRNAs plays their role not only on transposable elements repression, but also in regulating cellular development and differentiation. Beside their importance, piRNAs are not well known and still under explored. A comprehensive and easy-to-use piRNA database is still need and should be an important contribution in making piRNAs widely studied. Results: Here, we present piRNAdb, an integrative and user-friendly database designed for the study of piRNAs. piRNAdb integrate 27,329 piRNAs from four human small RNAseq datasets, as well as their genomic location, clustering information and transposable elements related. We also display information about 23,380 genes that are putative targets of piRNAs, like CNTNAP2, with functions in nervous system development, and SMC5, a DNA repair genes acting on double-strand breaks and differentially expressed in cancer. Related to those genes, we found 47,060 ontology terms, data only available on our website. Moreover, our database also provides a feedback system to improve the information exchange and knowledge among those studying piRNAs. The development of new features to facilitate piRNAs analyses, data visualization and integration is the major pillar of piRNAdb. We believed that our webtool, by providing a streamline and well-organized data repository to piRNAs, will be extremely important not only to those already studying piRNAs, but also to making piRNAs accessible to a broad community.

Reboot

Gene expression data is becoming more and more common and available nowadays with the advent of modern RNA-Seq machines since a few years ago. Furthermore, several tools are available to process such data. However, most of them require a level of expertise restricted to computational biologists. Consequently, clinicians often make decisions regarding a patient's prognosis solely based on mutation data generated by specialized companies, without considering neither their expression levels nor associated alternative splicing patterns. This is especially relevant because not all alterations will be, in fact, expressed in a given tissue of a particular patient under a specific space/time condition. Therefore, this implicates a non-straightforward relationship between the current molecular diagnostic approaches and clinical applications. To improve current translational medicine strategies, we present Reboot (REgression and survival tool with a multivariate BOOTstrap approach): a user-friendly web application to perform survival analysis from high-dimensional gene/transcripts expression datasets. Reboot innovates by using a multivariate strategy with penalized Cox regression (LASSO method) combined with a bootstrap approach, in addition to statistical tests for supporting the findings, which are automatically pre-filtered and plotted. Moreover, Reboot has its more powerful command-line version, whose documentation is well-detailed, allowing even less-experienced users to perform not only modular but also integrative analyses and validations.

RCPedia

Retrocopy is the result of a process in which mRNAs are reverse-transcribed into cDNA and inserted back into a new position on the genome, usually by retroelement machinery. Since the retrocopies are based on mature mRNA they lack many of their parental genes' genetic features, such as introns and regulatory elements. Most retrocopies have turned into pseudogenes (also known as processed pseudogenes) in mammals and some of them may recruit upstream regulatory elements and be come functional. RCPedia is a dedicated database providing information about retrocopies (or processed pseudogenes). Currently, RCPedia has cataloged 219,948 retrocopies across 44 eukaryotic species. The database also presents information about retrocopy genomic coordinates, context, expression, species conservation, parental genes, and more.

FREDY

Transposable elements (TEs) constitute a significant portion of mammalian genomes, accounting for about 50% of the total DNA. Intragenic TEs are of particular interest as they are co-transcribed with their host genes in pre-mRNA, potentially leading to the formation of novel chimeric transcripts and the exonization of TEs. The abundance of RNA sequencing data currently available offers a unique opportunity to explore transcriptomic variations. However, a significant limitation is the capability of existing computational tools. Here, we introduce FREDDIE, an innovative algorithm designed to detect the exonization of retrotransposable elements using RNA-seq data. FREDDIE can process short and long RNA sequencing data, assemble and quantify transcripts, evaluate coding potential, and identify protein domains in chimeric transcripts involving exonized TEs and retrocopies.

sideRETRO

Retrocopies, also known as processed pseudogenes, are transposable derived sequences generated by the duplication of protein coding genes through the transposition of their mature messenger RNA by using (in trans) the LINE-1 enzymatic machinery. Retrocopies may be ﬁxed, present in all genomes of a given species (i.e., in all individuals, including the assembly of the species' reference genome) or unﬁxed (polymorphic, germinal or somatic) in the genomes, here deﬁned as retroCNVs. While ﬁxed retrocopies have received attention from the scientiﬁc community, knowledge about retroCNVs remains limited, due to the lack of bioinformatics tools specialized in their identiﬁcation and annotation in DNA sequencing data. To address this gap, we present sideRETRO, a dedicated computational algorithm, which detects retrocopies absent from the reference genome but present in whole genome and exome sequencing data from other individuals. In addition to identifying retroCNVs, sideRETRO annotates various characteristics associated with these events. It provides information on the genomic coordinates of the insertion, including chromosome, insertion point, and DNA strand. It also determines the genomic context of the event (exonic, intronic, or intergenic), performs genotyping (presence or absence), and provides haplotyping information (homozygous or heterozygous).

Sandy

Next-generation sequencing (NGS) is the leading method for large-scale genome and transcriptome studies. However, the data analysis phase poses challenges in method selection and parameter tuning. Simulated NGS datasets offer a cost-effective solution, providing known true values for standardizing analysis methodologies. While existing simulation tools have usability limitations, we introduce Sandy, an open-source simulator for generating synthetic reads resembling DNA or RNA NGS data across various platforms. Sandy is user-friendly, computationally efficient, and can mimic real NGS features like sequencing quality and genomic variations. We demonstrate Sandy's utility by addressing key NGS assay design questions, such as optimal read count for unbiased gene expression analysis and minimum genome coverage for variant identification. Sandy serves as an invaluable tool for pipeline validation and cost assessment, compatible with Linux, MacOS, and Windows systems, even on personal computers.