Documentation


Introduction

ExSurv is a web resource for studying the survival contributions of exons across human cancers using RNA-seq data. ExSurv is the first web server which provides exon level survival significance by using the RNA-seq expression datasets and the clinical metadata for four cancer types from The Cancer Genome Atlas (TCGA) project. We pre-calculated the prognostic significance of more than 600000 annotated exons in Ensembl using survival package in R. We stored the TCGA clinical data, exon survival p-values and the expression of the significant exons for visualizing the survival curves in a MySQL database. We developed an integrated backend using PHP and R and used JavaScript in the frontend. The PHP/R backend is reponsible for querying the MySQL database upon user input, calling R to visualize the corresponding database results (using survival package) and returning these results to the frontend. In the frontend, the results are shown to the users in an organized way as a table where each row corresponds to an exon in the queried gene symbol or Ensembl gene ID. It is possible to export the survival plots in SVG (scalable vector graphics) format and the raw data used to generate the plot in TSV (tab-separated values) format.

Usage

Using the form in the navigation bar across this website, it is possible to query our database for survival contributions of exons given a gene symbol or Ensembl gene ID, a p-value threshold (defaulting to 0.05), a cancer type (defaulting to GBM - Glioblastoma multiforme) and a classification method (defaulting to Median). After search is done, the results (if there is any available under these conditions) are listed as a table where each row corresponds to an exon and it has exon ID, transcript ID, gene ID, gene symbol, hazard ratio, hazard ratio p-value, hazard ratio q-value, plot and export options columns. The identifiers (IDs) and the symbol are linked to their sources which help users to investigate more on these results. The plots can be zoomed in by clicking on them or can be exported as SVG. We are also providing an export option for downloading the raw dataset per exon as well as the entire MySQL database is available as an SQL dump via this link (in SQL format, compressed size ~2.5GB, uncompressed size ~12.5GB). Please see Figure S1 in our publication for the database schema.

Datasets

We downloaded the raw TCGA RNA-seq datasets from Database of Genotypes and Phenotypes (dbGaP) for BRCA (breast invasive carcinoma), GBM (glioblastoma multiforme), KIRP (kidney renal papillary cell carcinoma) and LIHC (liver hepatocellular carcinoma). BRCA has 1040 samples, GBM has 174 samples, KIRP has 287 samples and LIHC has 368 samples. Please see Table 1 in our publication for more statistics.

Preprocessing

The raw RNA-seq reads are downloaded from dbGaP and they are aligned against human genome 38 using HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts). Next, the alignment files are given to StringTie for exon expression quantification. After the expression values are obtained, we classified the exons into "High" and "Low" expressed groups using "Median" approach where the exons above the median expression value are called "High" and below ones called "Low". We also did the classification based on top and bottom quartiles and named "High" and "Low" expressed, respectively. Next, the survival analysis is done on these two different classifications separately in R (survival package) using Kaplan–Meier estimator and log-rank test as well as Cox proportional hazard ratio. Since the final significance values were very similar for the two methods, we are only showing p-value and q-values (obtained using Benjamini-Hochberg correction method) from Cox proportional hazard ratio analysis. The final result tables are imported to a MySQL database.

Web server

The web server consists of a frontend and a backend. The frontend is interacting with user through a simple form where the user enters a gene symbol or an Ensembl gene ID, p-value threshold, a cancer type and a classification method. It also visualizes the results after the search and provides export options for obtained results. The backend is responsible for querying the database, running R to visualize query results (to obtain survival plots) and sending the results to the frontend.