10 clusterProfiler’s ORA

In this script, we will

  1. run clusterProfiler’s ORA tool based on the results table of differential expression analysis for the gene set databases
  • KEGG

  • GO (with default subontology “MF”)

  1. go through all meaningful researchers’ degrees of freedom

Note that clusterProfiler’s ORA accepts the gene IDs in the NCBI (Entrez) ID format, independent of the chosen gene set database.

10.1 Libraries

All necessary packages are available on Bioconductor, and should be installed from there if not already available on your machine.

install.packages("BiocManager")
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db")

Load Libraries

Description of the packages

  • clusterProfiler: We here use clusterProfiler’s implementation of ORA.

  • org.Hs.eg.db: Provides genome-wide annotation for humans. When working with a different organism, the user has to provide a different package (see Chapter 1 ‘About’). Note that this library is required when running clusterProfiler’s ORA with gene set database GO.

10.2 Load data

load("./data/Results_Differential_Expression_Analysis/DE_results_limma_Entrez.Rdata")
load("./data/Results_Differential_Expression_Analysis/DE_results_DESeq2_Entrez.Rdata")
load("./data/Results_Differential_Expression_Analysis/DE_results_edgeR_Entrez.Rdata")

We arbitrarily set the object DE_results we will work with resulting from DESeq2. However, you can switch around at your discretion.

DE_results <- DE_results_DESeq2_Entrez

# alternatively: 
#DE_results <- DE_results_limma_Entrez
#DE_results <- DE_results_edgeR_Entrez

10.3 Run clusterProfiler’s ORA

10.3.1 step 1: Preparation of required input object

clusterProfiler’s ORA requires as input a list of differentially expressed genes. We extract such list from the results table of differential expression analysis.
For each gene from the results of differential expression analysis, we indicate whether it is differentially expressed (TRUE) or not differentially expressed (FALSE) based on the following two criteria:

  1. it was tested for differential expression, i.e. has a non-missing adjusted p-value AND

  2. it has an adjusted p-value < 0.05

# indicate which gene fulfills BOTH criteria: 
ind_differentially_expressed <- ((!is.na(DE_results$p_adj)) & (DE_results$p_adj<0.05))

# using this indicator, we extract the vector of differentially expressed genes from the results
# of differential expression analysis 
DEG_vec <- rownames(DE_results[ind_differentially_expressed,])

inspect the first few genes from the input list of differentially expressed genes

head(DEG_vec, n = 10)
#>  [1] "8228"   "23136"  "55787"  "10732"  "412"    "219699"
#>  [7] "6615"   "6192"   "8226"   "10912"

10.3.2 step 2: Run clusterProfiler’s ORA

ORA can be run with the common geneset databases KEGG and GO as well as user-defined gene set databases. Here, we focus on the two common gene set databases GO and KEGG.

10.3.2.1 option 1: gene set database GO

Here, we work the the subontology specified by default, namely Molecular Function (“MF”)

ORA_results_GO <- enrichGO(gene = DEG_vec, 
                           OrgDb = org.Hs.eg.db, 
                           ont = "MF") 

Inspect the results table

head(ORA_results_GO , n = 10)
#> [1] ID          Description GeneRatio   BgRatio    
#> [5] pvalue      p.adjust    qvalue      geneID     
#> [9] Count      
#> <0 rows> (or 0-length row.names)

Function arguments:

  • gene: vector of differentially expressed genes

  • OrgDb: annotation package for organism at hand (here: human)

  • ont: subontology (“MF” by default, alternatives: “BP” and “CC”)

10.3.2.2 option 2: gene set database KEGG

ORA_results_KEGG <- enrichKEGG(gene = DEG_vec, 
                               organism = "hsa")
#> Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
#> Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...

# inspect results table 
head(ORA_results_KEGG , n = 10)
#> [1] ID          Description GeneRatio   BgRatio    
#> [5] pvalue      p.adjust    qvalue      geneID     
#> [9] Count      
#> <0 rows> (or 0-length row.names)

Function arguments:

  • gene: vector of differentially expressed genes

  • organism: organism from which gene expression measurements are obtained

    • by default, it is set ‘organism = “hsa”’.

    • must be adapted for other organisms (such as organism = “mmu” for mice)

10.3.3 step 4: Interpretation of the results

Description of columns in results table:

  • GeneRatio: number of genes from the input list that are members of the given gene set divided by the number of genes from the input list that are NOT members of the given gene set

  • BgRatio: number of genes from the universe that are members of the gene set divided by the total number of genes in the universe

  • pvalue: p-value of enrichment of the given gene set

  • p.adjust: p-value of enrichment ADJUSTED for multiple testing

  • qvalue: p-value of enrichment ADJUSTED for multiple testing

    • note: p.adjust and qvalue are adjusted using slightly different approaches
  • geneID: list of genes from the input list that are members of the given gene set

  • count: number of genes from the input list that are members of the given gene set

10.4 Researchers’ degrees of freedom

In this part, we will go through all parameters that can be adapted in the GOSeq workflow. It is important to note that the intention behind going through the researchers’ degrees of freedom is to give you an understanding of what you can do to adapt the given (parameter) setting to the research question. It is even more important to keep in mind that the intention behind going through these flexible parameters is NOT to change them in order to help you obtain the most preferable results by systematically changing these parameters since such behaviour would correspond to “cherry-picking”. Any changes in the parameter choice must be documented transparently.

10.4.1 change 1: Change universe

Here, we change the universe to all genes measured in the experiment.

Note that we do not change the universe to the interception between all genes from the experiment and the list of genes annotated to the given gene set database since we found no way to obtain the latter list of genes.
Also note that we want to restrict ourselves to all genes in the experiment that HAVE an adjusted p-value (i.e. whose expression was indeed measured). The intuition is that, e.g. for DESeq2, some genes are filtered out internally and therefore do not have an adjusted p-value. These genes therefore neither be detected as differentially expressed or not differentially expressed so it would not be feasible to include them in the universe.

Step 1: Set up alternative universe

# (i) indicate which genes have an adjusted p-value
ind_adjp <- !is.na(DE_results$p_adj)

# (ii) filter the genes from the experiment to those genes that do have an adjusted p-value 
alternative_universe <- rownames(DE_results)[ind_adjp]

# inspect the first few genes in the universe
head(alternative_universe, n = 10)
#>  [1] "8813"  "57147" "55732" "2268"  "2519"  "4800"  "81887"
#>  [8] "22875" "5893"  "572"

Step 2: Add alternative universe as a parameter to ORA

In both functions enrichGO() and enrichKEGG(), an alternative universe can be specified in argument universe.

(a) gene set database GO: specify parameter universe

ORA_results_GO_universe  <- enrichGO(gene = DEG_vec, 
                                     OrgDb = org.Hs.eg.db, 
                                     universe = alternative_universe)

Inspect the results table:

head(ORA_results_GO_universe, n = 10)
#> [1] ID          Description GeneRatio   BgRatio    
#> [5] pvalue      p.adjust    qvalue      geneID     
#> [9] Count      
#> <0 rows> (or 0-length row.names)

(b) gene set database KEGG: specify parameter universe

ORA_results_KEGG_universe <- enrichKEGG(gene = DEG_vec, 
                                        organism = "hsa",
                                        universe = alternative_universe)

Inspect the results table:

head(ORA_results_KEGG_universe, n = 10)
#> [1] ID          Description GeneRatio   BgRatio    
#> [5] pvalue      p.adjust    qvalue      geneID     
#> [9] Count      
#> <0 rows> (or 0-length row.names)