Aggregate peak analysis with Hi-C data

Published: October 29, 2020   |   Read time:

Tagged:

Hi-C data analysis is still a relatively new field in genomics. The data itself is quite large and expensive to make, which means datasets and exploration of the data is still immature, compared to other technologies like RNA-seq. Here, I want to discuss aggregate peak analysis, a commonly-used and poorly-documented analytical technique to verify identified features in Hi-C data.

The basics of Hi-C data

Hi-C is a molecular technique used to measure the 3D contacts between different segments of DNA within the nucleus. It is often combined with high throughput sequencing to get a measure of contacts across the entire genome. Roughly, it can be visualized like so (from Rao et al., Cell, 2014).

Visualization of the Hi-C technique -100%

The sequencing data comes in the form of read pairs, where one mate corresponds to DNA from one locus, and the other mate corresponds to DNA from another locus. For a full-depth discussion of what Hi-C data looks like, see papers discussing the pre-processing of the data, such as HiCUP or Hi-C Pro.

Suffice to say, at the end of our pre-processing, we end up with what’s called a “contact matrix”, \(M\). Each row and column of the matrix corresponds to a genomic region, and the \((i,j)\)-th element of the contact matrix, \(M_{ij}\) corresponds to the number of reads found to connect region \(i\) to region \(j\). These read counts are a proxy for the amount of contact between any two regions.

The most important parameter of a contact matrix is its resolution, i.e. the size of the genomic bins. These are most often fixed-width intervals, such as 40 kbp, or even as small as 1 kbp for extremely deeply-sequenced data. A discussion of what is an appropriate resolution for your matrix given your sequencing depth is a story for another time 1. But suffice to say, if we want to look at the relationship between gene promoters and enhancers, which can be 10s to 100s of kbp apart from each other, we probably want a contact matrix with a resolution of somewhere close to 10 kbp.

Visual representation of contact matrices -100%

There are a few standardized formats in which this data can be stored.

Format Description Pros Cons
COO A sparse matrix representation format, stored in plain text. 1. Easy to parse in any language
2. Human readable
1. Deep sequencing means files dozens of gigabytes large, even when compressed
2. Storing numbers as characters is inefficient and takes up more space than necessary
3. No optimization for genomics applications, like chromosomes or sample metadata
hic An indexed binary format designed to permit fast random access to contact matrix heatmaps. 1. One of the first file formats for working with Hi-C data. Supported by many tools and the 4DNucleome Consortium
2. Easily visualizable with Juicebox
3. Data structure allows for fast access of any genomic regions, which further improves visualization with Juicebox.
1. Not human readable
2. Creation of a .hic file from raw sequencing data requires working with Juicer and pre 2
cooler An efficient storage format for high resolution genomic interaction matrices. 1. An widely supported format that is also supported by the 4DNucleome
2. Built on the HDF5 storage format for efficient storage
3. Extremely well documented
4. Natively ships with intuitive a CLI and Python API
1. A newer format that many tools published in 2009-2016 do not support
2. Slow repeated matrix entry retrieval3

For simplicity, we are going to make use of the Cooler file format. Its associated Python package, as well as the cooltools Python package will give us everything we need to work with Cooler files. Before we dive into the analytical stuff, let’s discuss where aggregate peak analysis is useful, and how we go about it.

The basics of loop calling and aggregate peak analysis

Loop calling is a method to identify pairs of genomic regions that are strongly in contact with each other. This is useful for identifying loops between promoters and enhancers, or cohesin-extruded loops. For a detailed analysis of loop calling techniques, see the Supplementary Materials in Rao et al., Cell, 2014 for the HICCUPS algorithm, and/or other papers such as FitHiC2, Mustache, and SIP.

The main idea with loop calling is analogous to that of “peak calling” from ChIP-seq data. We want to find these biologically meaningful interactions in our data, and the way we do it is by identifying small regions of locally-enriched reads. Peak calling, through tools such as MACS, identifies “peaks” in the 1D read pileups. Loop callers identify “peaks” in 2D data by comparing the observed read counts in the 2D read pileups (aka the contact matrices).

Loop calling as 2D peak calling -100%

Identifying these loops takes computational tools and some operational definitions of what loops look like in Hi-C data. As I have discussed before, operational definitions are not the same as biological definitions, so these identified loops require some validation. Wet lab validation of these loops is ideal, say, for validating whether this loop connects an enhancer to a gene promoter, which regulates the expression of that gene. But doing this for the possible thousands of identified loops becomes nearly impossible without accurate multiplexed experiments 4. So how can we attempt to validate these loop calls?

That’s what aggregate peak analysis (APA) is for. APA is a computational method for collecting loop calls and attempting to validate the loops as a set, not as individual loop calls. It does this by comparing the observed reads at the centre of the loop calls to the surrounding contact matrix. The process can be visualized like so.

APA schematic -80%

How can we accomplish this computationally? There are many ways to do it. As alluded to earlier, because of the size of Hi-C datasets, it’s easy to do this inefficiently and have the calculations take lots of time to complete. Below we’re going to make use of the cooler file format and cooler and cooltools Python packages to do this.

Code for computing aggregate peak analyses

First we need to calculate the expected contact matrix from the original matrix of observed contacts. We can do this using the cooltools command line subcommand compute-expected.

# calculate the "expected" matrix from a cooler file
$ cooltools compute-expected -o path/to/sample.exp.cis.tsv path/to/sample.cool

Next, we’re going to extract contact matrix regions, not just from the observed contacts, but from the observed/expected contact matrix. We can load the data into Python using the cooler, negspy, and pandas packaged.

# ==============================================================================
# Environment
# ==============================================================================
from cooler import Cooler       # interacting with Cooler files
from cooltools import snipping  # functions for extracting matrix values
import negspy.coordinates as nc # a package for easily dealing with genome coordinates
import pandas as pd

# resolution for contact matrix
RES = 10000

# +/- number of bps for aggregate peak analysis
SHIFT_SIZE = 300000

# genome coordinates
hg38 = nc.get_chrominfo("hg38")

# making a table of the chromosomes and their sizes
CHROM_SIZES = pd.DataFrame(
    {"size": list(hg38.chrom_lengths.values())},
    index = hg38.chrom_lengths.keys()
)
CHRS = list(CHROM_SIZES.index)

# make a tuple of chromosomes and their boundaries (will need later)
supports = [(chrom, 0, CHROM_SIZES.at[chrom, "size"]) for chrom in CHRS]

# ==============================================================================
# Data
# ==============================================================================
# load a BEDPE file containing the loop calls
loops = pd.read_csv(
    "path/to/loops.bedpe",
    sep="\t",
    header=None,
    names=["chr_x", "start_x", "end_x", "chr_y", "start_y", "end_y"],
)

# load the cooler file
obs = Cooler("/path/to/sample.cool")

# load the expected contact matrix from `cooltools compute-expected`
exp = pd.read_csv(
    "/path/to/sample.exp.cis.tsv",
    sep="\t",
    header=[0],
)

With our data loaded, we now want to calculate the observed/expected matrix. We can compute this for the entire genome, but we can save time and memory by only computing it where we need it. We can do this with the snipping submodule from the cooltools package.

# make a set of windows for the first region in each loop pair
windows_x = snipping.make_bin_aligned_windows(
    binsize=RES,                                            # resolution of the contact matrix
    chroms=loops["chr_x"],                                  # loop chromosomes
    centers_bp=(loops["start_x"] + loops["end_x"]) // 2,    # integer division to find centre of the loop
    flank_bp=SHIFT_SIZE,                                    # how far around the centres to extract
)
# repeat the above for the second region in each loop pair
windows_y = snipping.make_bin_aligned_windows(
    binsize=RES,
    chroms=loops["chr_y"],
    centers_bp=(loops["start_y"] + loops["end_y"]) // 2,
    flank_bp=SHIFT_SIZE,
)
# merge the two pairs of windows together
windows_merged = pd.merge(
    left=windows_x,
    right=windows_y,
    left_index=True,
    right_index=True,
    suffixes=(
        "1",
        "2",
    ),  # these suffixes are required, assign_regions doesn't work properly if not '1' and '2'
)
# assign loop-flanking regions to the low/high indices
windows = snipping.assign_regions(windows_merged, supports)
# remove any regions that are out of bounds
windows = windows.dropna()

This gives us our 2D windows (i.e. submatrices of the entire contact matrix) to extract values from. We can extract the observed/expected values directly as follows:

import numpy as np
# a generator that will extract obs/exp values where we specify
snipper = snipping.ObsExpSnipper(obs, exp)

# extracting the actual values
# this produce an MxMxN tensor, where N is the number of loops, and
# M = 2 * SHIFT_SIZE / RES + 1, the flanking region around the centre of each loop
# the obs/exp contact matrix around loop i is then stack[:, :, i]
# this step takes a little while to complete
stack = snipping.pileup(windows, snipper.select, snipper.snip)

# calculate the mean of all the flanking matrices, element-wise
# need nanmean here since some elements of the contact matrix may be empty
pileup = np.nanmean(stack, axis=2)

We’re now left with an M-by-M matrix that contains the mean of all the loops calls and the flanking regions. We can compare the centres of the loop calls to the corners. If the flanking region is large enough, then we can pick the same size as the centre of the pileup, just at the corners.

For example, if we have a 10 kbp matrix and take a 300 kbp flanking region, the resulting matrix will be 61-by-61. We can then take the 5x5 centre of the matrix with pileup[28:33, 28:33], and the corners as pileup[:5, :5], pileup[:5, -5:], pileup[-5:, :5], and pileup[-5:, -5:].

If we have a large number of loops, then we can use a paired two-sample t-test to compare the centre to one (or more) of the corners. Because we expect that the loops with have greater contact at the centres, we need to make this a one-sided hypothesis test. We can compute this manually, or we can feed the set of loop-flanking matrices themselves and use a function like scipy.stats.ttest_rel that will handle all the internal stuff for us.

import scipy.stats as stats

# extract the centres
# using np.array on the list makes the first index the new index for the loop
# i.e. (i, j, k) is the index for bin (i, j) in the loop-flanking matrix for loop k
# but the corresponding index for `centres` is (k, i, j)
centres = np.array([stack[28:33, 28:33, i] for i in range(stack.shape[2])])
# extract the bottom-left corner
corners = np.array([stack[:5, -5:, i] for i in range(stack.shape[2])])
# perform hypothesis test
t, p = stats.ttest_rel(
    a=centres.flatten(),
    b=corners.flatten(),
    axis=0,
    nan_policy = "omit",
)
# `ttest_rel` and other hypothesis tests in the scipy package only perform two-sided tests
# we can convert it to a one-sided test by using the standard t distribution CDF
p = stats.t.cdf(
    x=1 - t,                # alternative hypothesis of mean(centre) > mean(corner)
    df=stack.shape[2] - 1,  # N-1 degrees of freedom
)

We can use the pileup matrix to visualize this enrichment with the matshow function from the matplotlib package.

import matplotlib as mpl
mpl.use("agg")
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

ncols = 2
nrows = 1
# create grid specification
gs = GridSpec(nrows=nrows, ncols=ncols, width_ratios=[20] * (ncols - 1) + [1],)
# plotting options
plt.figure(figsize=(5 * (ncols - 1), 4 * nrows))
# make component plots
opts = dict(
    extent=[
        -(SHIFT_SIZE // 1000),  # values in kbp
        (SHIFT_SIZE // 1000),
        -(SHIFT_SIZE // 1000),
        (SHIFT_SIZE // 1000),
    ],
    cmap="bwr",
)
ax = plt.subplot(gs[0, 0])
img = ax.matshow(np.log2(pileup), **opts)
# add x axis labels to bottom-most subplots
ax.set_xlabel("Position")
ax.set_ylabel("Position")
ax.xaxis.set_visible(True)
ax.yaxis.set_visible(True)
ax.xaxis.tick_bottom()

# add colourbar
ax = plt.subplot(gs[:, ncols - 1])
ax.set_xlabel("log2(Obs / Exp)\nContact Frequency")
ax.yaxis.tick_right()
plt.colorbar(img, cax=ax)
plt.savefig("apa.png")
plt.close()

Putting it all together with some example data

We’ll get some example Hi-C data from the 4DNucleome Consortium. There is one Drosophila melanogaster dataset available (accession number 4DNESFOADERB), which we can download. This is a multi-resolution cooler file, which is a cooler file that has been binned into a couple different matrix resolutions. It won’t make that much of a difference to what we’ve described above, we just have to extract a contact matrix at a single resolution before we proceed.

# call loops with Mustache
$ mustache \
    -f 4DNFIZ1ZVXC8.mcool \ # multi-resolution cooler file
    --chromosome chrX \     # which chromosome
    -r 10000 \              # resolution of the contact matrix
    -o loops.tsv            # output file

# compute the expected matrix (be sure to pick the same resolution as above)
$ cooltools compute-expected -o exp.cis.tsv 4DNFIZ1ZVXC8.mcool::/resolutions/10000

This produces loops.tsv containing the loop calls and exp.cis.tsv with the expected contact frequencies. We can then work with this in a Python script as above.

# ==============================================================================
# Environment
# ==============================================================================
from cooler import Cooler       # interacting with Cooler files
from cooltools import snipping  # functions for extracting matrix values
import negspy.coordinates as nc # a package for easily dealing with genome coordinates
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib as mpl
mpl.use("agg")
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import matplotlib.patches as patches

# resolution for contact matrix
RES = 10000

# +/- number of bps for aggregate peak analysis
SHIFT_SIZE = 300000

# genome coordinates
dm6 = nc.get_chrominfo("dm6")

# making a table of the chromosomes and their sizes
CHROM_SIZES = pd.DataFrame(
    {"size": list(dm6.chrom_lengths.values())},
    index = dm6.chrom_lengths.keys()
)
CHRS = list(CHROM_SIZES.index)

# make a tuple of chromosomes and their boundaries (will need later)
supports = [(chrom, 0, CHROM_SIZES.at[chrom, "size"]) for chrom in CHRS]

# ==============================================================================
# Data
# ==============================================================================
# load a BEDPE-like file containing the loop calls
loops = pd.read_csv(
    "loops.tsv",
    sep="\t",
    header=[0],
)

# load the cooler file
obs = Cooler("4DNFIZ1ZVXC8.mcool::/resolutions/" + str(RES))

# load the expected contact matrix from `cooltools compute-expected`
exp = pd.read_csv(
    "exp.cis.tsv",
    sep="\t",
    header=[0],
)

# ==============================================================================
# Analysis
# ==============================================================================
# 1. Convert loop calls into paired windows
# ------------------------------------------------
# make a set of windows for the first region in each loop pair
windows_x = snipping.make_bin_aligned_windows(
    binsize=RES,                                                # resolution of the contact matrix
    chroms=loops["BIN1_CHR"],                                   # loop chromosomes
    centers_bp=(loops["BIN1_START"] + loops["BIN1_END"]) // 2,  # integer division to find centre of the loop
    flank_bp=SHIFT_SIZE,                                        # how far around the centres to extract
)
# repeat the above for the second region in each loop pair
windows_y = snipping.make_bin_aligned_windows(
    binsize=RES,
    chroms=loops["BIN2_CHROMOSOME"],
    centers_bp=(loops["BIN2_START"] + loops["BIN2_END"]) // 2,
    flank_bp=SHIFT_SIZE,
)
# merge the two pairs of windows together
windows_merged = pd.merge(
    left=windows_x,
    right=windows_y,
    left_index=True,
    right_index=True,
    suffixes=(
        "1",
        "2",
    ),  # these suffixes are required, assign_regions doesn't work properly if not '1' and '2'
)
# assign loop-flanking regions to the low/high indices
windows = snipping.assign_regions(windows_merged, supports)
# remove any regions that are out of bounds
windows = windows.dropna()

# 2. Extract submatrices around the loop calls
# ------------------------------------------------
# a generator that will extract obs/exp values where we specify
snipper = snipping.ObsExpSnipper(obs, exp)

# extracting the actual values
stack = snipping.pileup(windows, snipper.select, snipper.snip)

# calculate the mean of all the flanking matrices, element-wise
# need nanmean here since some elements of the contact matrix may be empty
pileup = np.nanmean(stack, axis=2)

# 3. Perform hypothesis testing
# ------------------------------------------------
# extract the centres
centres = np.array([stack[28:33, 28:33, i] for i in range(stack.shape[2])])
# extract the bottom-left corner
corners = np.array([stack[:5, -5:, i] for i in range(stack.shape[2])])
# perform hypothesis test
t, p = stats.ttest_rel(
    a=centres.flatten(),
    b=corners.flatten(),
    axis=0,
    nan_policy = "omit",
)
# `ttest_rel` and other hypothesis tests in the scipy package only perform two-sided tests
# we can convert it to a one-sided test by using the standard t distribution CDF
p = stats.t.cdf(
    x=1 - t,                # alternative hypothesis of mean(centre) > mean(corner)
    df=stack.shape[2] - 1,  # N-1 degrees of freedom
)

# ==============================================================================
# Plots
# ==============================================================================
ncols = 2
nrows = 1
# create grid specification
gs = GridSpec(nrows=nrows, ncols=ncols, width_ratios=[20] * (ncols - 1) + [1],)
# plotting options
plt.figure(figsize=(5 * (ncols - 1), 4 * nrows))
# make component plots
opts = dict(
    extent=[
        -(SHIFT_SIZE // 1000),  # values in kbp
        (SHIFT_SIZE // 1000),
        -(SHIFT_SIZE // 1000),
        (SHIFT_SIZE // 1000),
    ],
    cmap="bwr",
)
ax = plt.subplot(gs[0, 0])
img = ax.matshow(np.log2(pileup), **opts)
# add a rectangle plot highlighting the centre
rect = patches.Rectangle(
    (-5 * RES // 1000, -5 * RES // 1000),
    11 * RES // 1000,
    11 * RES // 1000,
    linewidth=1,
    edgecolor='black',
    facecolor='none',
)
ax.add_patch(rect)
# add a rectangle plot highlighting the corner
rect = patches.Rectangle(
    (-SHIFT_SIZE // 1000, -SHIFT_SIZE // 1000),
    11 * RES // 1000,
    11 * RES // 1000,
    linewidth=1,
    edgecolor='black',
    facecolor='none',
)
ax.add_patch(rect)

# add test statistics
plt.text(
    x=0,
    y=-SHIFT_SIZE // 2000,
    s="t = {:.2f}\np = {:.2f}".format(t, p),
    horizontalalignment="center",
)

# add x axis labels to bottom-most subplots
ax.set_xlabel("Position")
ax.set_ylabel("Position")
ax.xaxis.set_visible(True)
ax.yaxis.set_visible(True)
ax.xaxis.tick_bottom()

# add colourbar
ax = plt.subplot(gs[:, ncols - 1])
ax.set_xlabel("log2(Obs / Exp)\nContact Frequency")
ax.yaxis.tick_right()
plt.colorbar(img, cax=ax)
plt.savefig("calc-apa.png")
plt.close()

This calculates everything you need, and produces the following plot:

Calculated APA plot from _Drosophila_ loops on chrX -100%

Conclusions

This gives the results of APA on a set of loops from a single sample. We can extend the above to incorporate multiple samples, look at CTCF binding sites, TAD boundaries, or any other feature of Hi-C contact matrices.

I avoided discussing some of the inefficient ways of calculating this, such as repeated calls with cooler’s mtx.matrix() accessor, or dumping the matrices into plain text sparse matrices. These are equivalent ways to calculating the APA for contact matrices, and may be a bit conceptually simpler for a few loci. But when you need to consider thousands of loci or a dozen samples, this process is tedious and takes a lot of time. The above is simple, using well-supported Python packages, and can be parallelized or done incrementally with intermediate data dumps using the pickle package.

References & Footnotes

  1. In Rao et al., Cell, 2014, they give the following guidelines for an appropriate resolution: “We define the ‘matrix resolution’ of a Hi-C map as the locus size used to construct a particular contact matrix and the ‘map resolution’ as the smallest locus size such that 80% of loci have at least 1,000 contacts. The map resolution is meant to reflect the finest scale at which one can reliably discern local features”. 

  2. In my personal experience, I had an extremely difficult time working with Juicer to process my Hi-C data. It has a non-trivial setup, much of its processing requires GPUs, which are not commonly accessible for individuals or computational clusters, and the code for getting it to work felt very hacky, using custom scripts written in about 6 different languages. Others that I know personally have had success with it, but it required making assumptions about data locations, admin access on the cluster, and other assumptions that were incompatible with my environment. The cooler format and its associated tools, on the other hand, gave me almost all of benefits of Juicer with few of the impositions on my workflow. 

  3. It is extremely easy to performs operations on the entire Cooler file or extracting data from individual genomic regions. However, for extracting multiple loci from the same, or different, Cooler files can be slow. On average, extracting a 10x10 region took ~ 1s of computational time. This is an effect of using out-of-core operations and I/O retrieval that Cooler makes use of to stay space- and memory-efficient. This is fine if you’re only looking at a few regions, but when trying to do certain tasks with thousands of regions, like the matrix stacking in aggregate peak analysis, it’s easy to have simple operations take hours to complete. 

  4. One way to accomplish this, biologically, may be with the help of massively parallel reporter assays (MPRAs). These are, however, technically difficult to get right in the wet lab and statistically difficult to work with. See this paper for some methods for dealing with MPRA data, and this paper from my colleagues that did this with mutations and cis-regulatory elements in prostate cancer.