12 GSEAPreranked

In this script, we will go through the process of preparing the required input for GSEAPreranked²⁰ (which is part of the web-based tool GSEA²¹ ).
Note that in contrast to the remaining tools considered in this work, it is recom- mended for GSEAPreranked to provide the input with the genes ID as HGNC (HUGO) gene symbols.²² The reason for this is described in the supplement of the paper.

Therefore, our starting point is the pre-filtered gene expression data set with the gene IDs in the initial (Ensembl) ID format.

We proceed in the following order:

Conversion of the gene IDs to HGNC (HUGO) symbols and removal of resulting duplicated gene IDs
Differential expression analysis using voom/limma, DESeq2, and edgeR
Generation of the gene ranking from each results table generated in (ii)
Export of gene rankings to text file
Link to information on how to further process the gene rankings in Excel

12.1 Libraries

All necessary packages are available on Bioconductor, and should be installed from there if not already available on your machine.

install.packages("BiocManager")
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("tweeDEseqCountData")
BiocManager::install("limma")
BiocManager::install("edgeR")
BiocManager::install("DESeq2")
BiocManager::install("dplyr")

Load libraries:

library(clusterProfiler)
library(org.Hs.eg.db) 
library(tweeDEseqCountData)
library(limma)
library(edgeR)
library(DESeq2)
library(dplyr)

Description of the libraries:

clusterProfiler: To convert Ensembl ID format to HGNC (HUGO) gene symbols
org.Hs.eg.db: To indicate that we work with human gene expression data (must be adapted when working with a different organism)
tweeDEseqCountData: To obtain the conditions of the samples in the gene expression data set
limma: For differential expression analysis (based on which the gene ranking is generated)
edgeR: For differential expression analysis (based on which the gene ranking is generated)
DESeq2: For differential expression analysis (based on which the gene ranking is generated)
dplyr: To unify the relevant columns in the results of differential expression analysis across the three methods

12.2 Load data

We will proceed with the gene expression data set which was pre-filtered using edgeR’s filterByExpr().

load("./data/Results_PreFiltering/expression_data_filterByExpr.Rdata")

Alternatively, you can also load the gene expression data set which was manually pre-filtered (as proposed by DESeq2):

# load("./data/Results_PreFiltering/expression_data_filterDESeq2.Rdata")

For a simplified readability, we will proceed with the pre-filtered gene expression data set under a different name:

expression_data_filtered <- expression_data_filterByExpr

12.3 Step 1: Convert Ensembl IDs to HGNC symbols

(i) Obtain the mapping of initial (Ensembl) gene IDs to required HGNC gene symbols

bitr_EnsToSymb <- bitr(geneID = rownames(expression_data_filtered),
                       fromType = "ENSEMBL",
                       toType = "SYMBOL",
                       OrgDb = org.Hs.eg.db)
#> 'select()' returned 1:many mapping between keys and
#> columns
#> Warning in bitr(geneID =
#> rownames(expression_data_filtered), fromType = "ENSEMBL", :
#> 3.84% of input gene IDs are fail to map...

Function arguments:

geneID: gene IDs to be converted (stored in the rownames of the gene expression data set)
fromType: initial gene ID format that is to be converted
toType: gene ID format to be converted to
OrgDb: annotation data base of organisms from which the gene expression measurements originate. The corresponding argument must be loaded as a library (see above)

Note that the arguments “fromType” and “toType” must be set as one of the following, depending on the given and the required gene ID format:

ACCNUM, ALIAS, Ensembl, EnsemblPROT, EnsemblTRANS, ENTREZID, ENZYME, EVIDENCE, EVIDENCEALL, GENENAME, GENETYPE, GO, GOALL, IPI, MAP, OMIM, ONTOLOGY, ONTOLOGYALL, PATH, PFAM, PMID, PROSITE, REFSEQ, SYMBOL, UCSCKG, UNIPROT.

(ii) Concatenate the initial gene expression data set with the mapping from initial (Ensembl) to required HGNC symbols

Merge by row names of expression data set and ENSEMBL ID of gene ID mapping.

expression_data_merged <- merge(x = expression_data_filtered,
                                             y = bitr_EnsToSymb,
                                             by.x = 0, 
                                             by.y = "ENSEMBL",
                                             sort=TRUE)

Description of function arguments:

by.x and by.y: specify by which columns in expression_data_filterByExpr and bitr_EnsToEntr, respectively, both data sets are concatenated:
- by.x = 0: use row names of expression_data_filterByExpr (the row names contain the gene IDs)
- by.y = “ENSEMBL”: use the column that contains the Ensembl gene IDs

Analogous Chapter ‘Gene ID conversion and removal of the resulting duplicated gene IDs’, we work with three separate ways to remove the duplicate gene IDs that result fromvgene ID conversion. We will apply all three of these approaches to removal.

12.4 Step 2: Take a closer look at the duplicated gene IDs

Case 1: single ENSEMBL ID is mapped to multiple HGNC symbols

In this case, we denote the corresponding ENSEMBL IDs as the duplicated IDs

# obtain number of cases in which an ENSEMBL gene ID was converted to several HGNC symbols, i.e. the number of
# times an Ensembl ID appears more than once in the mapping
sum(duplicated(bitr_EnsToSymb$ENSEMBL))
#> [1] 34
# determine all duplicated ENSEMBL gene IDS (i.e. all ENSEMBL gene IDs that were mapped to multiple distinct HGNC gene symbols):
dupl_ensembl <- unique(bitr_EnsToSymb$ENSEMBL[duplicated(bitr_EnsToSymb$ENSEMBL)])
# number of ENSEMBL IDs that have at least one duplicate
length(dupl_ensembl)
#> [1] 34

In the following, we can inspect the conversion pattern for each Ensembl ID that was mapped to multiple distinct HGNC symbols:

# get subset conversion pattern of all duplicated Ensembl IDs 
duplicated_conversion_ens<-bitr_EnsToSymb[bitr_EnsToSymb$ENSEMBL %in% dupl_ensembl,]
# take a look at conversion pattern for some of the duplicated Ensembl gene IDs:
head(duplicated_conversion_ens, n = 6)
#>             ENSEMBL          SYMBOL
#> 303 ENSG00000063587          ZNF275
#> 304 ENSG00000063587    LOC105373378
#> 615 ENSG00000088298           EDEM2
#> 616 ENSG00000088298 MMP24-AS1-EDEM2
#> 683 ENSG00000090857            PDPR
#> 684 ENSG00000090857    LOC124907803

We can see perfectly that these Ensembl ID are each mapped to multiple individual HGNC symbols.

Case 2: multiple ENSEMBL IDs are mapped to the same HGNC symbol

In this case, we denote the corresponding HGNC symbols as the duplicated IDs.

Obtain number of cases in which multiple distinct Ensembl IDs are converted to the same HGNC gene symbol

# in these cases, the corresponding HGNC symbols appear repeatedly in the mapping:
sum(duplicated(bitr_EnsToSymb$SYMBOL))
#> [1] 3

# determine all HGNC gene symbols affected by this duplication
dupl_symbol <- unique(bitr_EnsToSymb$SYMBOL[duplicated(bitr_EnsToSymb$SYMBOL)])

# inspect the first few HGNC symbols affected by a duplication
head(dupl_symbol, n = 10)
#> [1] "H4C15" "OPN3"

# get number of HGNC symbols that have at least one duplicate
length(dupl_symbol)
#> [1] 2

In the following, we want to take a look at the conversion pattern of some of the duplicated HGNC symbols.

# subset the conversion pattern to that of the duplicated HGNC symbols
duplicated_conversion <- bitr_EnsToSymb[bitr_EnsToSymb$SYMBOL %in% dupl_symbol,]

# for visibility: order the conversion pattern by the HGNC symbols
duplicated_conversion <- duplicated_conversion[order(duplicated_conversion$SYMBOL),]

# take a look at conversion pattern of the first few duplicated HGNC symbols:
head(duplicated_conversion, n = 6 )
#>              ENSEMBL SYMBOL
#> 3605 ENSG00000158406  H4C15
#> 5675 ENSG00000197061  H4C15
#> 5736 ENSG00000197837  H4C15
#> 250  ENSG00000054277   OPN3
#> 5875 ENSG00000203668   OPN3

12.5 Step 3: Remove the duplicated gene IDs

For the subsequent analysis (whether differential expression analysis of gene set analysis), we need to deal with the duplicated gene IDs and remove them in a suitable way such that only one unique gene expression measurement among the duplicates remains. There is no recommended way to proceed, i.e. no common approach presented in official scientific puplicatios, but instead several approaches suggested by users in corresponding user platforms. We have observed that the manner of duplicate gene ID removal seems to be chosen at the discretion of the user. We therefore present three approaches to duplicate gene ID removal.

12.5.1 Option 1: Keep the duplicate with the lowest subscript

This is the simplest approach among the three.

1. Remove duplicated HGNC gene symbols

# indicate which HGNC gene symbols are NO duplicate of any other gene ID
ind_nodupl <- !duplicated(expression_data_merged$SYMBOL)

# filter the gene expression data set to the genes that are NO duplicate of any other gene ID 
exprdat_symbol_dupl1 <- expression_data_merged[ind_nodupl,]

2. Remove duplicated ENSEMBL gene IDs

# indicate which Ensembl IDs are NO duplicate of any other gene ID 
ind_nodupl <- !duplicated(exprdat_symbol_dupl1$Row.names)

# filter the gene expression data set to the genes that are NO duplicate of any other gene ID 
exprdat_symbol_dupl1 <- exprdat_symbol_dupl1[ind_nodupl,]

3. Set HGNC gene symbols as rownames

rownames(exprdat_symbol_dupl1) <- exprdat_symbol_dupl1$SYMBOL

#Remove columns containing ENSEMBL and HGNC symbols
exprdat_symbol_dupl1 <- subset(exprdat_symbol_dupl1, select=-c(Row.names,SYMBOL))

# Inspect the dimension of the resulting gene expression data set
dim(exprdat_symbol_dupl1)
#> [1] 6006   69
# we particularly see that the number of columns is now at its initial number

We now want to inspect a small part of the gene expression data set:

exprdat_symbol_dupl1[1:10, 1:10]
#>        NA18486 NA18498 NA18499 NA18501 NA18502 NA18504
#> DPM1        22     105      40      55      67      37
#> SCYL3       22     100     107      53      72      38
#> FIRRM        5      23      10      18      15       8
#> FGR         36      70      41      33      59      29
#> FUCA2       29      79      33      31      29      21
#> NFYA       301     351     344     176     340     170
#> LAS1L       32      22       4       8      15       5
#> ENPP4       52     419     173     137     127      72
#> BAD         48      81      65      19      59      57
#> HS3ST1       8      15       8      18       4      27
#>        NA18505 NA18507 NA18508 NA18510
#> DPM1        88     127      70      43
#> SCYL3       98      69      66      43
#> FIRRM       11      16      18       7
#> FGR         22      71      12      43
#> FUCA2       42      62      41      25
#> NFYA       238     247     226     227
#> LAS1L       15      16      19      15
#> ENPP4      267     247     154      81
#> BAD         44      79      71      79
#> HS3ST1       3      11       5       3

12.5.2 Option 2: Keep the (rounded) mean expression value of all duplicated gene IDs

Here, we remove duplicated HGNC gene symbols (case 2) before removing the duplicated Ensembl gene IDs (case 1). The reason for this is elaborated below.

1. Remove the duplicated HGNC gene symbols

#generate matrix to contain (rounded) mean expression values of all rows that have same HGNC gene symbol
#ncol=ncol(expression_data_filterByExpr)-2 since data set contains 2 columns with IDs at this point
mean_symbol <- matrix(nrow=0, ncol=ncol(expression_data_merged)-2)


# for each duplicated HGNC gene symbol separately, we gather all rows with the corresponding gene expression data and then extract the (rounded) mean expression value of all rows
for(i in 1:length(dupl_symbol)){

# go through each HGNC symbols which occurs multiple times
# determine all rows whose HGNC symbols correspond to currently considered HGNC symbol
counts_dupl <- expression_data_merged[expression_data_merged$SYMBOL %in% unique(dupl_symbol)[i],]

#compute the mean expression value of all rows that contain to the given HGNC gene symbol
dupl_id <- round(colMeans(counts_dupl[,c(2:(ncol(expression_data_merged)-1))]))

# store rounded mean expression value in matrix
# this matrix is extended by a single row of gene expression data which corresponds to the (rounded) mean expression data that corresponds to the given HGNC gene symbol
mean_symbol <- rbind(mean_symbol,dupl_id)

}

#set corresponding HGNC gene symbols as rownames
rownames(mean_symbol) <- unique(dupl_symbol)

# after completing the for-loop, mean_symbol contains the mean expression measurements of each HGNC gene symbol which contains duplicates resulting from gene ID conversion
# We want to take a look at a part of the data set 
mean_symbol[, 1:9]
#>       NA18486 NA18498 NA18499 NA18501 NA18502 NA18504
#> H4C15      27     109      31      50      57      18
#> OPN3       50     182      82     118      60      52
#>       NA18505 NA18507 NA18508
#> H4C15      56      34      52
#> OPN3      102      80      78


# test whether the number of rows in mean_symbol corresponds to the number HGNC symbols
# that occur more than once
# result should be TRUE
nrow(mean_symbol) == length(dupl_symbol)
#> [1] TRUE

# remove all rows from the expression data whose HGNC symbol has at least one duplicate
# intuition: we have just dealt with the corresponding rows and do not want them to be considered in the second step (which deals with case 2
exprdat_symbol_dupl2 <- expression_data_merged[!expression_data_merged$SYMBOL %in% dupl_symbol,]

# test whether the number of rows in resulting data set equals nrow of inital data set minus number of genes with at least one duplicate
nrow(exprdat_symbol_dupl2) == nrow(expression_data_merged)-nrow(duplicated_conversion)
#> [1] TRUE

2. Remove the duplicated ENSEMBL IDs

Caution: Single ENSEMBL IDs that are mapped to multiple HGNC symbol naturally generate identical count data for all corresponding HGNC symbols. It is therefore pointless to compute mean expression values. This is verifiable by looking at data set only containing those ENSEMBL IDs that are mapped by multiple HGNC symbols:

test_dupl_ensembl <- exprdat_symbol_dupl2[exprdat_symbol_dupl2$Row.names %in% dupl_ensembl,]

# take a look at a few entries of the data set:
# note that we additionally include the last column of the data set which corresponds to the HGNC symbols
test_dupl_ensembl[1:6, c(1:5, ncol(test_dupl_ensembl))]
#>           Row.names NA18486 NA18498 NA18499 NA18501
#> 299 ENSG00000063587     143     116     244      76
#> 300 ENSG00000063587     143     116     244      76
#> 610 ENSG00000088298     202     384     368     207
#> 611 ENSG00000088298     202     384     368     207
#> 678 ENSG00000090857      59      41     102      96
#> 679 ENSG00000090857      59      41     102      96
#>              SYMBOL
#> 299          ZNF275
#> 300    LOC105373378
#> 610           EDEM2
#> 611 MMP24-AS1-EDEM2
#> 678            PDPR
#> 679    LOC124907803

We see that those rows that originate from the same Ensembl ID contain identical count data.

We therefore proceed as in option 1 and use the HGNC symbol that occurs first while we remove the rest.

# Keep HGNC symbol that occurs first
exprdat_symbol_dupl2<-exprdat_symbol_dupl2[!duplicated(exprdat_symbol_dupl2$Row.names),]

3. Set HGNC symbol as rownames

rownames(exprdat_symbol_dupl2)<-exprdat_symbol_dupl2$SYMBOL

# Remove the columns containing ENSEMBL and HGNC symbols
exprdat_symbol_dupl2<-subset(exprdat_symbol_dupl2,select= -c(Row.names,SYMBOL))

# Add those rows to the data set that contain mean expression values of duplicate HGNC symbols
exprdat_symbol_dupl2 <- rbind(exprdat_symbol_dupl2,mean_symbol)


#Inspect the dimension of the remaining gene expression data set
dim(exprdat_symbol_dupl2)
#> [1] 6007   69

12.5.3 Option 3: Among the duplicates, keep row with highest overall expression values (i.e highest counts across all samples)

The intuition behind this approach is that the row with highest counts values has highest power of being detected as differentially expressed. As in option 2, this applies only to the duplicates that result from multiple ENSEMBL IDs being mapped to the same HGNC symbol.

1. Remove duplicated HGNC symbols

# Generate a matrix to later contain row with highest count values among ID duplicates this data set is to be filled gradually and with each iteration of the follwing for-loop
highest_count_symbol<-matrix(, nrow=0, ncol=ncol(expression_data_filterByExpr))


# For each duplicated HGNC gene symbol separately, we gather all rows with the corresponding gene expression data and then extract the row with the highest overall magnitude of counts

for(i in 1:length(dupl_symbol)){

# Go through each HGNC symbols which occurs multiple times
# determine all rows whose HGNC symbols correspond to currently considered HGNC symbol
counts_dupl<-expression_data_merged[expression_data_merged$SYMBOL %in% unique(dupl_symbol)[i],]

# Order rows in decreasing manner by their number of read counts across all samples
order_rowsums<-order(rowSums(counts_dupl[,2:(ncol(counts_dupl)-1)]),decreasing=TRUE)
  
# Detect row with highest number of read counts across all samples (i.e. row with rank 1)
dupl_id<-counts_dupl[order_rowsums==1,]
  
#store corresponding expression
highest_count_symbol<-rbind(highest_count_symbol,dupl_id)
  
}


#Remove all initial values with HGNC duplicates from the dataset initial gene expression data set
exprdat_symbol_dupl3<-expression_data_merged[! expression_data_merged$SYMBOL %in% unique(dupl_symbol),]

2. Remove duplicated ENSEMBL IDs

As in option 2, it is pointless to detect the row with highest count values as all rows that correspond to the same ENSEMBL ID as these rows naturally contain identical count data. We therefore remove the duplicate ENSEMBL ID that occurs first (such as in option 1).

# Keep the corresponding ENSEMBL ID that occurs first
exprdat_symbol_dupl3<-exprdat_symbol_dupl3[!duplicated(exprdat_symbol_dupl3$Row.names),]

# Add the gene expression rows of all HGNC gene symbols that were initially duplicated
exprdat_symbol_dupl3<-rbind(exprdat_symbol_dupl3,highest_count_symbol )

3. Set HGNC symbols as rownames

rownames(exprdat_symbol_dupl3)<-exprdat_symbol_dupl3$SYMBOL
#Remove any column that contains information on gene IDs
exprdat_symbol_dupl3<-subset(exprdat_symbol_dupl3, select=-c(Row.names,SYMBOL))

# Inspect the dimension of the resulting gene expression data set 
dim(exprdat_symbol_dupl3)
#> [1] 6007   69
# we see that the sample size (number of columns) is now back at its initial number

12.6 Intermediate step: Choose which converted gene expression data sets to proceed with

In this illustration, we will proceed with only one of the three gene expression data sets that result from the conversion of the gene IDs and the removal of the duplications. Here, we will proceed with the first approach to the removal of duplicated gene IDs. However, you can easily switch to another approach at your discretion.

We will proceed with the gene expression data set stored as object “expression_data_filt_symbol”.

expression_data_filt_symbol <- exprdat_symbol_dupl1

12.7 Step 3: Differential expression analysis

From the corresponding results table of differential expression analysis, we will later transform certain metrics (i.e. columns) into a gene-level statistic for each gene based on which the gene ranking is generated.

We will proceed analogously to Chapter ‘Differential Expression Analysis’, in which we perform differential expression analysis using three established methods, namely

voom/limma
DESeq2
edgeR

While we already have the count data at hand that we will be working with, we additionally need the conditions of the samples. Since we work with the Pickrell data set in these illustrations, we have to load the corresponding data.

# Load pickrell data 
data(pickrell)

# Store the sample conditions in the object sample_conditions 
sample_conditions <- pickrell.eset$gender

12.7.1 Option 1: Differential expression analysis using limma

Note that the code illustrations are based on the following user manual: (https://bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf)

1. Generate the required input object for limma

As described in “Instructions_Differential_Expression_Analysis”, the gene expression data set expression_data_filt_symbol is currently a data frame and needs to be converted to a DGEList object.

# store expression data with corresponding sample conditions in object of class DGEList
y <- DGEList(counts = expression_data_filt_symbol,
             group = sample_conditions)

Required function arguments:

counts: matrix that contains RNA-Seq data (i.e. count data)

Optional function arguments:

group: indicates the condition of each sample/library
- this argument must be specified sooner or later (such as in subsequent functions) so we just specify it at this point.

Note that we leave the remaining arguments in their default state since the corresponding info will be added through the following pipeline of functions.

2. Normalization

The following piece of code generates a normalization factor for each sample. This accounts for sample-specific effects (such as differences in library sizes and effects of compositionality). If not accounted for, these effects prevent a comparison between the samples.

y <- calcNormFactors(y)

Note that this function does not transform the count data, but rather generates a normalization factor for each sample which is incorporated into the subsequent analysis separately.

3. voom transformation

Note that the voom transformation facilitates the use of the subsequent functions that were initially developed for microarray data.

# design matrix (rows correspond to samples, columns indicate which coefficients are to be estimated)
design_matrix <- model.matrix(~sample_conditions)

# voom transformation (transformation of count data to log-cpm values, computation of observation weight for each
# gene based on mean-variance relationship)
y <- voom(y, design = design_matrix)

4. Differential expression analysis

# Fit a linear model for each gene
y <- lmFit(y)

# Calculate the statistics for the assessment of differential expression
y <- eBayes(y)

# Get the results table for each gene whose differential expression was assessed
DE_results_limma <- topTable(y, number =  nrow(y))
#> Removing intercept from test coefficients
# Note that number = nrow(y) ensures that all genes are displayed in results, not just a subset

5. Rename columns in results table

This step is not required for differential expression analysis or for the subsequent use of gene set analysis IN GENERAL.

A renaming of some columns (such as adjusted p-value) is necessary in the context of these illustrations since the different methods for differential expression analysis typically differ in the column names of the results tables. A unification of the column names is required so that we can use the same code to illustrate further conduct of GSA in the subsequent code.

# First, we transform to results table to a data frame so that we see the results table directly when accessing it through the name "res"
DE_results_limma <- as.data.frame(DE_results_limma)

# Rename the column that contains the adjusted p-values: rename padj to p_adj
DE_results_limma <- dplyr::rename(DE_results_limma, p_adj = `adj.P.Val`)

6. Inspect the first 10 genes in the results table

head(DE_results_limma, n = 10)
#>              logFC  AveExpr         t      P.Value
#> RPS4Y1   9.2106424 1.986675 47.198797 1.005487e-55
#> TMSB4Y   5.0807298 0.513264 26.672932 6.011554e-39
#> PNPLA4  -0.9239961 5.333790 -8.777656 5.619670e-13
#> PUDP    -0.8543759 2.163068 -4.414608 3.506569e-05
#> CXorf38 -0.5507332 3.716678 -4.250337 6.330976e-05
#> TXLNG   -0.4042256 5.116432 -4.043565 1.310034e-04
#> G0S2    -1.4700846 5.192623 -3.855239 2.497623e-04
#> JUN     -0.5879335 8.273471 -3.919967 2.004610e-04
#> APAF1    0.4047167 5.968529  3.823276 2.782012e-04
#> IL1A    -1.4856099 3.543969 -3.666732 4.683127e-04
#>                p_adj           B
#> RPS4Y1  6.038958e-52 84.22462726
#> TMSB4Y  1.805270e-35 62.69727996
#> PNPLA4  1.125058e-09 19.21578210
#> PUDP    5.265113e-02  2.14888617
#> CXorf38 7.604769e-02  1.55522518
#> TXLNG   1.311344e-01  0.73590696
#> G0S2    1.856530e-01  0.15458026
#> JUN     1.719955e-01  0.11850631
#> APAF1   1.856530e-01 -0.07239398
#> IL1A    2.623670e-01 -0.24410772

12.7.2 Option 2: Differential expression analysis using DESeq2

The official DESeq2 vignette can be found through the following link:
http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

1. Generate the required input object for DESeq2

DESeq2 operates on the format DESeqDataSet which contains information on the count data, the conditions of the samples and the design (for further information see below). Note that for gene expression data sets that must be imported to R, additional steps are necessary before the following code can be run.

Generate a data frame which contains the condition of each sample

Here, the information on the sample conditions is the only column in the data frame. However, if further variables (such as batch effects) are to be controlled for, the corresponding variables must additionally be added to coldata.

# Here, the information on the sample conditions is the only column in the data frame
# However, if further variables (such as batch effects) are to be controlled for, the corresponding variables must additionally be added to coldata

# The names of the samples are stored as the row names of the data frame
# Important: make sure that the order of the conditions in sample_conditions corresponds to the order of the
# samples in expression_data
coldata <- data.frame(sample_conditions,
                      row.names = colnames(expression_data_filt_symbol))

# Rename the column header to "condition"
colnames(coldata) <- "condition"

# Recode the variable "condition" as a factor
# Rename the sample conditions (in this case from "female" and "male") to "untreated" and "treated"
# Note: make sure that the control level in variable condition is coded as the first level (i.e. "untreated")
coldata$condition <- factor(coldata$condition,
                            labels = c("untreated","treated"))

Generate the DESeqDataSet

dds <- DESeqDataSetFromMatrix(countData = expression_data_filt_symbol,
                            colData = coldata,
                            design = ~ condition)

Relevant function arguments:

countData: count data from gene expression data set
colData: data frame that contains information on the samples (see above)
- conditions of the samples (required) and possibly further variables to correct for (such as batch effects)
design: indicates which variables from colData are used for modelling
- more detailed: the argument design is used to estimate the dispersions and the log2 fold changes of the model
- if more than one variable from colData are used in argument design (e.g. a second variable “batch”), the syntax changes to the following formula: design ~ batch + condition
- make sure that the variable of interest (here: variable that represents conditions of the samples) is placed at the end of the formula

2. Differential expression analysis

# 1. Perform default differential expression analysis
dds <- DESeq(dds)
# 2. Generate results table which provides
DE_results_DESeq2 <- results(dds)

Description of the functions:

DESeq(): Estimation normalization factors, estimation of dispersions, fitting of generalized linear model, Wald statistics

results(): Provides base mean across all samples, estimated log2 fold changes, standard errors, test statistics, p-values, adjusted p-values

3. Rename columns in results table

This step is not required for differential expression analysis or for the subsequent use of gene set analysis IN GENERAL.

# First, we transform to results table to a data frame so that we see the results table directly when accessing it through the name "res"
DE_results_DESeq2 <- as.data.frame(DE_results_DESeq2)

# Rename the column that contains the adjusted p-values: rename padj to p_adj
DE_results_DESeq2 <- dplyr::rename(DE_results_DESeq2, p_adj = "padj")

4. Inspect the first 10 genes in the results table

head(DE_results_DESeq2, n = 10)
#>         baseMean log2FoldChange      lfcSE        stat
#> DPM1    62.75664    0.039694123 0.11391433  0.34845593
#> SCYL3   70.48320    0.028962410 0.11169085  0.25930870
#> FIRRM   13.00620    0.165415797 0.15394501  1.07451223
#> FGR     43.59415    0.114144611 0.19558025  0.58362034
#> FUCA2   37.98066   -0.006462503 0.10902544 -0.05927518
#> NFYA   228.56697    0.051476343 0.09634771  0.53427676
#> LAS1L   11.53865    0.025240854 0.18770526  0.13447068
#> ENPP4  175.91248    0.300292682 0.16292585  1.84312480
#> BAD     66.39379   -0.264391459 0.10275324 -2.57307183
#> HS3ST1  10.12757   -0.199051398 0.31889483 -0.62419137
#>            pvalue     p_adj
#> DPM1   0.72749780 0.9221933
#> SCYL3  0.79539707 0.9423095
#> FIRRM  0.28259317 0.7199487
#> FGR    0.55947577 0.8668104
#> FUCA2  0.95273293 0.9859714
#> NFYA   0.59315007 0.8787295
#> LAS1L  0.89303039 0.9706589
#> ENPP4  0.06531079 0.4826632
#> BAD    0.01008003 0.2998836
#> HS3ST1 0.53250191 0.8522833

12.7.3 Option 3: Differential expression analysis using edgeR

The instructions for edgeR are based on the following vignette: (https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)

1. Generate the required input object for edgeR

Store the expression data with corresponding sample conditions in object of class DGEList

y <- DGEList(counts = expression_data_filt_symbol,
             group = sample_conditions)

Required function arguments:

counts: matrix that contains RNA-Seq data (i.e. count data)

Optional function arguments:

group: indicates condition of each sample/library
- This argument must be specified sooner or later (such as in subsequent functions) so we just specify it at this point

Note that we leave the remaining arguments in their default state since the corresponding info will be added through the subsequent pipeline of functions.

2. Normalization

y <- calcNormFactors(y)

Note: this function does not transform the count data, but rather generates a normalization factor for each sample which is incorporated into the subsequent analysis separately.

3. Estimation of the dispersion

Estimate the common and tagwise dispersion which quantify the variation of the true abundance of a given gene between different samples and is required to assess differential expression realistically.

y <- estimateDisp(y)
#> Using classic mode.

4. Differential expression analysis

# Test each gene for differential expression:
DE_results_edgeR <- exactTest(y)

# Extract pre-specified number (n) of genes
DE_results_edgeR <- topTags(DE_results_edgeR, n = nrow(DE_results_edgeR))

# Note: argument n specifies the number of top differentially expressed genes to be displayed in the results
# n = nrow(DE_results) ensures the results of all genes whose differential expression was assessed are displayed

5.Rename columns in results table

This step is not required for differential expression analysis or for the subsequent use of gene set analysis IN GENERAL.

# first, we transform to results table to a data frame so that we see the results
# table directly when accessing it through the name "res"
DE_results_edgeR <- as.data.frame(DE_results_edgeR)

# rename column that contains adjusted p-values: rename padj to p_adj
DE_results_edgeR <- dplyr::rename(DE_results_edgeR, p_adj = "FDR")

6. Inspect the first 10 genes in the results table

head(DE_results_edgeR, n = 10)
#>              logFC   logCPM        PValue         p_adj
#> RPS4Y1  10.1331218 6.155421  0.000000e+00  0.000000e+00
#> TMSB4Y   5.5792120 2.520038 6.284371e-121 1.887197e-117
#> PNPLA4  -0.9446936 5.485702  1.873415e-17  3.750577e-14
#> PUDP    -0.7768043 2.571376  3.072736e-06  4.613713e-03
#> SNAI1   -1.6460183 3.288600  4.708446e-06  5.469265e-03
#> GADD45G -1.5871184 2.680046  5.463801e-06  5.469265e-03
#> THEM5   -1.1256488 2.303307  1.949582e-05  1.672742e-02
#> CTSW     1.5020170 3.366261  2.908599e-05  2.183631e-02
#> TXLNG   -0.4252126 5.208043  3.558631e-05  2.271802e-02
#> TCFL5   -0.6027710 5.756923  3.782554e-05  2.271802e-02

Note that one gene has an adjusted p-value of 0. We will deal with this issue when generating the gene ranking from edgeR.

12.8 Step 4: Generate gene ranking

We will now apply a gene-level statistic to the results table of differential expression analysis to generate a gene ranking which will serve as input to GSEAPreranked.

As we have performed differential expression analysis with - voom/limma - DESeq2 - edgeR, we will also generate three gene rankings, one for each of these three methods.

The formula of the gene-level ranking metric is \[ (-1) * \log_{10}(p\text{-value}) * \text{sign}(\text{log fold change})\]. Note that by “p-value”, we refer to the non-adjusted p-value.

12.8.1 Option 1: Generate gene ranking from limma/voom results

# 1. Subset the gene expression data set to those genes that have a p-value (i.e.
# which have been NOT been excluded from differential expression analysis)

# indicate those genes WITH a p-value
ind_nonNA_pvalue_limma <- !is.na(DE_results_limma$P.Value)

# subset gene expression data set to those genes with a p-value
DE_results_noNA <- DE_results_limma[ind_nonNA_pvalue_limma, ]

# 2. create vector that contains the value of the gene-level ranking metric for each gene
rankvec_limma <- sign(DE_results_noNA$logFC)*(-1)*log10(DE_results_noNA$P.Value)

# 3. assign respective gene ID to each value in the vector
names(rankvec_limma) <- rownames(DE_results_noNA)

# 4. sort the vector in descending order
rankvec_limma <- sort(rankvec_limma, decreasing=TRUE)

Inspect the first 10 genes from the gene ranking

head(rankvec_limma, n = 10)
#>    RPS4Y1    TMSB4Y     APAF1      LPXN     TRAM2   VIPAS39 
#> 54.997623 38.221013  3.555641  3.201891  3.165236  3.072060 
#>    MAP3K1    SLC9A6      PHKB     HCFC2 
#>  3.019964  2.925713  2.875909  2.816500

12.8.2 Option 2: Generate gene ranking from DESeq2 results

1.Shrink estimated log fold change values

Before generating the gene ranking from the DESeq2 results, we need to perform an additional shrinkage that is specific to DESeq2. The intuition behind the shrinkage of log fold change values is as follows: As mentioned in the paper, RNA-Seq data consists (in its rare form) of count data which is inherently heteroscedastic, i.e. the variance of the count data depends on the mean count of the count data. It is observable that ratios between counts are considerably noisier in the lower magnitudes of counts compared to higher magnitudes, i.e.the log fold changes between both conditions are higher if the overall magnitude of counts is lower, independent of the actual extent to which the gene is differentially expressed between both conditions. DESeq2 addresses this issue by shrinking the estimated log fold changes in the direction of 0. The magnitude of shrinkage is higher if the available information for a gene is lower (which may be because of a low magnitude of counts, a high dispersion or few degrees of freedom.) A more detailed description is provided in the DESeq2 paper by Love et al. (2014).

The shrinkage is performed on object dds, which is a result of the function DESeq():

if (!any(installed.packages() %in% "apeglm")) {
  BiocManager::install("apeglm", update = FALSE, ask = FALSE)
}
DE_results_DESeq2_shrink <- lfcShrink(dds,
                                      coef = "condition_treated_vs_untreated",
                                      type="apeglm")

Function arguments:

type: method to perform shrinkage
- we opt for the default “apeglm” but you can choose from two alternative options as well
coef: indicate the coefficients to be shrunk
- we can obtain the right argument from the following function call:

resultsNames(dds)
#> [1] "Intercept"                     
#> [2] "condition_treated_vs_untreated"

This shows us that we can either shrink the intercept or the “condition_treated_vs_untreated”. Since we do not want to shrink the intercept but the log fold changes, we opt for the second option “condition_treated_vs_untreated”.

Finish up the results table:

# transform results table to data frame
DE_results_DESeq2_shrink <- as.data.frame(DE_results_DESeq2_shrink)

# Inspect the first 10 genes from the results table: 
head(DE_results_DESeq2_shrink, n = 10)
#>         baseMean log2FoldChange      lfcSE     pvalue
#> DPM1    62.75664    0.007288891 0.04890425 0.72749780
#> SCYL3   70.48320    0.005422660 0.04847084 0.79539707
#> FIRRM   13.00620    0.018798782 0.05470876 0.28259317
#> FGR     43.59415    0.008038882 0.05231455 0.55947577
#> FUCA2   37.98066   -0.001246862 0.04796503 0.95273293
#> NFYA   228.56697    0.012392873 0.04812269 0.59315007
#> LAS1L   11.53865    0.001891725 0.05141019 0.89303039
#> ENPP4  175.91248    0.035023088 0.06667768 0.06531079
#> BAD     66.39379   -0.153104170 0.13823651 0.01008003
#> HS3ST1  10.12757   -0.005389391 0.05306809 0.53250191
#>             padj
#> DPM1   0.9221933
#> SCYL3  0.9423095
#> FIRRM  0.7199487
#> FGR    0.8668104
#> FUCA2  0.9859714
#> NFYA   0.8787295
#> LAS1L  0.9706589
#> ENPP4  0.4826632
#> BAD    0.2998836
#> HS3ST1 0.8522833

Note that the estimated log fold change values (column log2FoldChange) are smaller (in absolute terms!) compared to the regular DESeq2 results, while the (adjusted and non-adjusted) p-value remain unchanged.
Also, we have not performed shrinkage in Chapter ‘Differential Expression Analysis’ as there, the goal was to get the list of differentially expressed genes which are detected solely based on the adjusted p-value.

2. Generate the gene ranking

# 1. Subset the gene expression data set to those genes that have a p-value (i.e.
# which have been NOT been excluded from differential expression analysis)

# indicate those genes WITH a p-value
ind_nonNA_pvalue <- !is.na(DE_results_DESeq2_shrink$pvalue)

# subset gene expression data set to those genes with a p-value
DE_results_noNA <- DE_results_DESeq2_shrink[ind_nonNA_pvalue, ]

# 2. create vector that contains the value of the gene-level ranking metric for each gene
rankvec_DESeq2 <- sign(DE_results_noNA$log2FoldChange)*(-1)*log10(DE_results_noNA$pvalue)

# 3. assign respective gene ID to each value in the vector
names(rankvec_DESeq2) <- rownames(DE_results_noNA)

# 4. sort the vector in descending order
rankvec_DESeq2 <- sort(rankvec_DESeq2, decreasing=TRUE)

Inspect the first 10 genes from the gene ranking

head(rankvec_DESeq2, n = 10)
#>     RPS4Y1     TMSB4Y       CTSW      APAF1     SLC9A6 
#> 204.393649  63.913902   4.462277   3.748896   3.567843 
#>       LPXN      HCFC2     MAP3K1    VIPAS39      TRAM2 
#>   3.558986   3.341867   3.281130   3.272801   3.175153

12.8.3 Option 3: Generate gene ranking from edgeR results

# 1. Subset the gene expression data set to those genes that have a p-value (i.e. 
# which have been NOT been excluded from differential expression analysis)

# indicate those genes WITH a p-value
ind_nonNA_pvalue <- !is.na(DE_results_edgeR$PValue)

# subset gene expression data set to those genes with a p-value 
DE_results_noNA <- DE_results_edgeR[ind_nonNA_pvalue, ]

# 2. create vector that contains the value of the gene-level ranking metric for each gene 
rankvec_edgeR <- sign(DE_results_noNA$logFC)*(-1)*log10(DE_results_noNA$p_adj)

# 3. assign respective gene ID to each value in the vector 
names(rankvec_edgeR) <- rownames(DE_results_noNA)

# 4. sort the vector in descending order 
rankvec_edgeR <- sort(rankvec_edgeR, decreasing = TRUE)

Inspect the first 10 genes from the gene ranking

head(rankvec_edgeR, n = 10)
#>      RPS4Y1      TMSB4Y        CTSW       APAF1        LPXN 
#>         Inf 116.7241828   1.6608208   1.0328005   1.0016012 
#>      SLC9A6      MAP3K1       TRAM2        ATP8      MICAL2 
#>   0.9854321   0.8638818   0.8499991   0.8447814   0.8258597

Note that gene RPS4Y1 has a ranking value of “Inf” (infinity). The reason behind this is that the gene has an adjusted p-value of \(0\) in the edgeR results of differential expression analysis (as already observed above). We have to solve this issue, however, there is no common default approach. For the illustration purposes here, we replace the value Inf by the second biggest ranking value in the ranking.

# replace value Inf by second biggest value in the ranking 
rankvec_edgeR[rankvec_edgeR == Inf] <- max(rankvec_edgeR[rankvec_edgeR != Inf])

# inspect ranking again
head(rankvec_edgeR, n = 10)
#>      RPS4Y1      TMSB4Y        CTSW       APAF1        LPXN 
#> 116.7241828 116.7241828   1.6608208   1.0328005   1.0016012 
#>      SLC9A6      MAP3K1       TRAM2        ATP8      MICAL2 
#>   0.9854321   0.8638818   0.8499991   0.8447814   0.8258597

12.9 Step 5: Export gene ranking to text file

For the purpose of simplicity, we only export one of the gene rankings to a text file. Here, we export the ranking generated with limma.

# this path indicates that we store the gene ranking as the text file 
# "gene_ranking.txt" in folder "Input_Objects_GSEAPreranked"
path_conditions <- "./data/Input_Objects_GSEAPreranked/gene_ranking.txt"

write.table(x = rankvec_limma,
            file = path_conditions,
            quote = FALSE,
            row.names = TRUE,
            col.names = TRUE)

Function arguments:

x: object to be exported to a text file
file: location (i.e. folder and file) in which the object is to be stored
quote = FALSE ensures that none of the characters (in this case gene and sample identifiers) are surrounded by double quotes
row.names = TRUE ensures that the gene IDs are included in the export
col.names = TRUE ensures that no gene information is in the first row since the web-based tool ignores whatever is in the first row

12.10 Step 6: Further preparation in Excel

For this step, follow the instructions on the very bottom of the following link:
(http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats)

Then, scroll to section “RNK: Ranked list file format (*.rnk)“.

Carlson, Marc. Org.hs.eg.db: Genome Wide Annotation for Human, 2022.

Cunningham, Fiona, James E Allen, Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, Olanrewaju Austine-Orimoloye, et al. “Ensembl 2022.” Nucleic Acids Research 50, no. D1 (2022): D988–95.

Gonzalez, Juan R, and Mikel Esnaola. tweeDEseqCountData: RNA-Seq Count Data Employed in the Vignette of the tweeDEseq Package, 2022. http://www.creal.cat/jrgonzalez/software.htm.

Huang, Da Wei, Brad T Sherman, and Richard A Lempicki. “Bioinformatics Enrichment Tools: Paths Toward the Comprehensive Functional Analysis of Large Gene Lists.” Nucleic Acids Research 37, no. 1 (2009): 1–13.

———. “Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.” Nature Protocols 4, no. 1 (2009): 44–57.

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12, no. 2 (2015): 115–21.

Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-seq Read Counts.” Genome Biology 15, no. 2 (2014): 1–17.

Love, Michael I, Wolfgang Huber, and Simon Anders. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (2014): 550.

Mootha, Vamsi K, Cecilia M Lindgren, Karl-Fredrik Eriksson, et al. “PGC-1\(\alpha\)-Responsive Genes Involved in Oxidative Phosphorylation Are Coordinately Downregulated in Human Diabetes.” Nature Genetics 34, no. 3 (2003): 267–73.

Pickrell, Joseph K, John C Marioni, Athma A Pai, Jacob F Degner, Barbara E Engelhardt, Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens, Yoav Gilad, and Jonathan K Pritchard. “Understanding Mechanisms Underlying Human Gene Expression Variation with RNA Sequencing.” Nature 464, no. 7289 (2010): 768–72.

Ritchie, Matthew E, Belinda Phipson, DI Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43, no. 7 (2015): e47–47.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26, no. 1 (2010): 139–40.

Robinson, Mark D, and Alicia Oshlack. “A Scaling Normalization Method for Differential Expression Analysis of RNA-seq Data.” Genome Biology 11, no. 3 (2010): 1–9.

Sayers, Eric W, Evan E Bolton, J Rodney Brister, Kathi Canese, Jessica Chan, Donald C Comeau, Ryan Connor, et al. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research 50, no. D1 (2022): D20–26.

Seal, Ruth L, Bryony Braschi, Kristian Gray, Tamsin EM Jones, Susan Tweedie, Liora Haim-Vilmovsky, and Elspeth A Bruford. “Genenames. Org: The HGNC Resources in 2023.” Nucleic Acids Research 51, no. D1 (2023): D1003–9.

Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102, no. 43 (2005): 15545–50.

Tarca, Adi L, Gaurav Bhatti, and Roberto Romero. “A Comparison of Gene Set Analysis Methods in Terms of Sensitivity, Prioritization and Specificity.” PloS One 8, no. 11 (2013): e79217.

Wu, Tianzhi, Erqiang Hu, Shuangbin Xu, Meijun Chen, Pingfan Guo, Zehan Dai, Tingze Feng, et al. “clusterProfiler 4.0: A Universal Enrichment Tool for Interpreting Omics Data.” The Innovation 2, no. 3 (2021): 100141.

Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. “Gene Ontology Analysis for RNA-seq: Accounting for Selection Bias.” Genome Biology 11, no. 2 (2010): 1–12.

11 PADOG