RNA-Seq Results Frequently Asked Questions

1. What are each of the result folders delivered to me?

Bam: Mapping .bam files

DEG: Differential gene expression analysis results for each comparison

Differential_splice_variant_expression: Differential splice variant expression analysis results for each comparison

Fastq: Raw .fastq files

GO: Gene ontology analysis results for each comparison

Hit-counts: Gene hit counts results

Report: Contains “RNASeq_report.html” which is the master report file

Stats: Detailed mapping statistics

2. What do the columns mean in the mapping statistics table in section 3.5?

Sample ID: Your sample name

Total Reads: Total number of trimmed reads

Total Mapped Reads: Total number of trimmed reads mapped to the reference genome including multi-mapped reads
(reads mapped to more than one location)

% Total Mapped Reads: Number of Total Mapped Reads divided by the Total Reads

Unique Mapped Reads: Number of trimmed reads mapped uniquely to only one location in the reference genome. Reads mapped to more than one location are not included in this statistic.

% Unique Mapped Reads: Number of Unique Mapped Reads divided by Total Reads

3. What do the columns mean in the Differential Gene Expression output Excel files?

ID: Gene ID

Log2FoldChange: The Log2 fold change of the normalized mean hit counts. The formula is: Log2(Group 2 mean normalized counts/Group 1 mean normalized counts) = Log2FoldChange

Group 1: First group listed in the testCondition.txt file
Group 2: Second group listed in the testCondition.txt file

pvalue: The Wald test p-value

Padj: The Benjamini-Hochberg adjusted p-value

[Sample Name]: (For each sample) the normalized hit counts for the gene

Gene.name: (If applicable) The gene symbol correlated with the gene listed in the "ID" column.

4. Which columns are most important in the Differential Gene Expression output?

Log2FoldChange and Padj. Log2FoldChange will quantify the expression change between the two groups, while Padj will indicate its statistical significance

5. Why and how are the gene hit counts data normalized?

The raw gene hit count measurements, if used directly for differential gene expression, would lead to many incorrect conclusions due to factors such as differences in read depth between samples and within-group variability. To draw appropriate conclusions, we normalize the gene hit counts within each sample using DESeq2, which scales them by a sample-specific normalization factor that corresponds to the total gene hit counts in a sample. Samples with more total gene hit counts will have their values decreased, while samples with less total gene hit counts will have their values increased.

Within-group variability is the difference of gene hit counts for a specific gene between replicate samples. We control this by shrinking the variability of hit counts within replicates to a common mean variability estimate. This shrinkage ensures the
hit counts within replicate samples are more similar.

After removing these sources of noise, the distribution of gene hit counts will be comparable across each sample.

6. How do I interpret the box plots?

The box plots indicate whether data normalization is occurring. The raw box plot will show the difference in average gene expression values and ranges across different samples. The normalized box plot will show highly similar expression values and ranges across samples.

7. How do I interpret the sample distances plot?

This plot indicates which samples have similar expression values for their genes. You can use it to determine reproducibility amongst replicate samples. Looking at the dendrogram, samples more closely together are more similar than those far apart.

8. How do I interpret the principal component analysis plot?

This plot performs a similar function as the sample distances plot. Samples clustering together are more similar than those clustering in other groups.

9. How do I interpret the differentially-expressed genes bi-clustering heat map?

This plot will perform a similar function as the sample distances plot. It will cluster both the samples, and the genes, for the top 30 differentially expressed genes. Yellow colors indicate higher relative expression, while blue colors indicate lower
relative expression.

10. How do I interpret the volcano plot?

The volcano plot maps fold changes against p-values and highlight the set of significantly differentially-expressed genes. Upregulated significant genes are red, downregulated significant genes are green, and non-significant genes are gray.

11. What do the columns mean for the Gene Ontology analysis results?

Genes: Matched genes for the corresponding functions

Process_name: Name of the matching GO function

Significant_genes_count: Number of hits in the functional database

Total_genes_group_count: Total number of genes involved in corresponding functions

Percent_significant_genes: Percentage of functional genes covered by user gene list

P-value: Probability of enrichment using Fisher’s exact test

Padj-value: Corrected P-values (False Discovery Rate)

12. Which GO columns are most important?

Genes, Process_name, and Padj-value are the most important. Sorting from lowest to highest on the Padj-value column will give you a list of the most statistically significant GO processes.

13. How do I view my Splice Variant Expression analysis results?

Within the differential splice variant expression folders for each comparison, there will be a subfolder named “DEXSeqReport.” Within this report, you can click on the “testForDEU.html” file to open an interactive splice variant expression report. Here, you can navigate to your gene of interest and observe its splicing profile by clicking through the different tabs.

14. Why do you use hit counts instead of TPM/RPKM/FPKM?

The DESeq2 normalization method of hit counts has been proven to be a very reliable method in determining differentially-expressed genes. It is becoming more standard to use the normalized hit counts generated in this method than TPM values. For each differential gene expression comparison, we also include the TPM values in case they are needed.

15. Why was my gene of interest not called significantly differentially expressed?

Many factors contribute to whether a gene is called significant or not, including the number of biological replicates, how well the replicates cluster together, and how much the gene is expressed. The cutoffs we set for statistically-significant, differentially-expressed genes are just recommendations. These cutoffs can always be changed based on your own preferences by investigating the main Differential_expression_analysis_table.csv included in your results. These contain the full list of all genes after filtering out those with average hit counts <10 across all samples, since these are considered noise.

16. How many group comparisons can you perform in the RNA-Seq package?

We can perform as many pair-wise comparisons as you’d like.

17. Will my deliverables change if I have no replicates?

We are unable to perform the differential splice variant analysis if you don’t have at least two replicates per group to compare. This is a requirement set by the package we use to perform this analysis.

18. If an outlier is identified in the PCA or Sample Distances plot, can we remove it and re-run?

Yes, we can remove the outliers and regenerate the differential gene expression analysis results to see if the clustering improves.

19. Why are my results from qRT-PCR different from RNA-Seq?

The results from qRT-PCR will not always correlate with the results from RNA-Seq due to inherent differences in the two workflows. qRT-PCR can be useful to confirm RNA-Seq results but will not always match.

20. Why do some of my p-values have a value of NA? Why do some of them have a value of 0?

Genes containing count outliers in one or more groups, as identified using Cook’s distance, will be assigned p-values of NA. The p-value may round down to 0 due to the floating-point precision. These can be thought of as very significant values

21. Is the differential gene expression analysis reliable if I don’t have any biological replicates?

We cannot guarantee the results of the differential gene expression analysis if you do not have any biological replicates. We recommend groups of at least 3 biological replicates to draw more accurate conclusions on the data. Replicate numbers fewer than this will likely result in an increased number of false positives/negatives.