1 Pre-filtering

In this script, we will learn about two options to exclude all genes that do not have a sufficient number of read counts across all samples. We distinguish between these two approaches since the different methods for differential expression analysis (see Chapter ‘Differential Expression Analysis’) propose different methods for pre-filtering.

option 1: pre-filtering using a function provided by edgeR
option 2: pre-filtering proposed by DESeq2

1.1 Libraries

All necessary packages are available on Bioconductor, and should be installed from there if not already available on your machine.

install.packages("BiocManager")
BiocManager::install("tweeDEseqCountData")
BiocManager::install("edgeR")

Load Libraries

library(tweeDEseqCountData)
library(edgeR)

Description of the libraries:

tweeDESeqCountData: from this library we obtain the gene expression data set we will use for our illustrations.
edgeR: offers a function for pre-filtering we will use below.

1.2 Preparation of RNA-Seq data set used for illustration

For the purpose of simplicity and readability, we store the gene expression measurements and sample conditions from the Pickrell data sets in objects with neutral names, namely in ‘expression_data’and ’sample_conditions’, respectively.

# load pickrell data set 
data(pickrell)

# access and store gene expression measurements
expression_data <- Biobase::exprs(pickrell.eset)

# access and store sample conditions 
sample_conditions <- pickrell.eset$gender

Take a look at a few entries of the gene expression data set:

# inspect the read counts of the first 5 genes in the first 5 samples
expression_data[1:5, 1:5]
#>                 NA18486 NA18498 NA18499 NA18501 NA18502
#> ENSG00000000003       0       0       0       0       0
#> ENSG00000000005       0       0       0       0       0
#> ENSG00000000419      22     105      40      55      67
#> ENSG00000000457      22     100     107      53      72
#> ENSG00000000460       5      23      10      18      15

Take a look at the sample conditions:

sample_conditions
#>  [1] male   male   female male   female male   female male  
#>  [9] female male   female male   female male   female male  
#> [17] female female male   female male   female female male  
#> [25] female male   female female male   female female male  
#> [33] female male   female female female female male   female
#> [41] male   male   female female male   female female male  
#> [49] female female male   female male   male   female female
#> [57] male   female male   female male   female female male  
#> [65] female female female male   female
#> Levels: female male

Inspect the number of genes (rows) and the number of samples (columns) in the gene expression data set:

dim(expression_data)
#> [1] 52580    69

1.3 Pre-Filtering

1.3.1 Option 1: Pre-Filtering using edgeR’s builting function

This approach works with the function filterByExpr() from the package edgeR and is the proposed method for pre-filtering for the methods for differential expression analysis edgeR and voom/limma. This approach operates on the cpm-transformed count data (cpm: counts-per-million) and excludes all genes that do NOT have a certain number of counts-per-million in a certain number of samples. Note that this approach accounts for differences in the library size between the different samples.

step 1: Generate input object required by filterByExpr()

expression_data_filterByExpr <- DGEList(counts = expression_data, 
                                        group = sample_conditions)

Description of function:

DGEList(): object to contain RNA sequencing measurements and additional information

Description of arguments:

counts: matrix of RNA-Seq data
group: vector that contains the condition of each sample

step 2: Filter out lowly expressed genes

The function filterByExpr() creates an indicator which on which genes do and which do not have a sufficient amount of read counts across all samples. Based on this indicator, the gene expression data set is then filtered.

# (i) for each gene, indicate if it fulfils the requirements to be kept for the subsequent analysis 
indicator_keep <- filterByExpr(expression_data_filterByExpr)

# (ii) filter the gene expression data set such that only those genes are kept which fulfill the requirements
expression_data_filterByExpr <- expression_data_filterByExpr[indicator_keep,, keep.lib.sizes = FALSE]

Note that the index keep.lib.sizes = FALSE ensures that the library size of each sample is recalculated after pre-filtering.

step 3: Obtain final pre-filtered gene expression data set

At this point, we transform the gene expression measurements back to a data frame. The reason for this is that some subsequent steps (such as the conversion of gene IDs do not work with the DGEList format).

expression_data_filterByExpr <- as.data.frame(expression_data_filterByExpr$counts)

# inspect the number of genes and the number of samples in the final pre-filtered gene expression data set 
nrow(expression_data_filterByExpr)
#> [1] 6246

Observe that the number of genes has been reduced compared to the original (unfiltered) gene expression data set.

1.3.2 Option 2: Simpler Pre-filtering approach (proposed by DESeq2)

A simpler approach for pre-filtering has been proposed by the method for differential expression analysis DESeq2. In this approach, only those genes are kept for further analysis that have a pre-specified number of counts \(X\) (such as 10) across all samples. A higher value of \(X\) thereby leads to more genes being removed.
Note that DESeq2 proposes a stricter version of pre-filtering in which those genes are kept which have \(X\) number of counts in at least \(Y\) samples.
Note that none of the these two “simpler” approaches to pre-filtering take differences in library size into account.

step 1: Pre-Filtering

# indicate which genes have at least 10 read counts across all samples:
indicator_keep <- rowSums(expression_data) >= 10 

# alternative (and more strict) indicator:
# indicator_keep <- rowSums( expression_data >=10) >= 10

# subset gene expression data set accordingly 
expression_data_filterDESEq2 <- expression_data[indicator_keep,]

step 2: Inspect final pre-filtered gene expression data set

dim(expression_data_filterDESEq2)
#> [1] 10151    69

Note that the number of genes in the gene expression data set has decreased compared to the initial (unfiltered) gene expression data set.

About

2 Gene ID conversion and removal of the resulting duplicated gene IDs