Creating a custom genome annotation for HiGlass

Published: May 15, 2020   |   Read time:

Tagged:

Genomic visualization software is tricky for a variety of reasons, and there are lots of tools available for different tasks. Very few, tools, however, have enough depth to their use cases to be useful for multiple scenarios. Fewer still, are designed for working with Hi-C data, a type of genome sequencing data built around the idea of three-dimensional self-interactions of DNA molecules.

One tool that does reasonably well for a lot of cases, and is particularly well-tailored for working with Hi-C data, is called HiGlass. HiGlass uses a mixture of mapping software, a Django/Flask backend, and efficient genome sequencing databases to render genomic positions and data at those positions. It can be deployed on a server, run as a Docker image, or locally through a Jupyter notebook. And it has a key feature that is important for working with sequencing data: you can work with it both programmatically and interactively. Because of these features, and the fact that I’m working on a project involving Hi-C data, I’ve come to use this tool frequently.

HiGlass architecture -80%

HiGlass hosts a bunch of pre-built datasets that you can load from their servers and play around with. But many bioinformaticians will fall into one of the following categories:

  • have some work on a system that is isolated from the internet
  • work in a less-well-studied organism than humans and have to build a lot of materials themselves
  • are testing slight tweaks to a previously existing database, and want to see what those differences look like

For these cases, it’s useful or necessary to create your own data about a reference genome. In this post, I want to use the case example of building your own reference genome annotation and displaying it in HiGlass. This is a conceptually simple task that does take a good amount of fiddling around with tables. But by the end of this guide, you should be able to:

  1. download reference genome data for your genome of interest
  2. use that reference annotation to create a database that HiGlass can display
  3. display your reference genome and play around in it.

Getting a reference genome annotation

For simplicity, I’m going to build the human hg38 genome1. The important features that we want to display in any genome are the gene names, their genomic positions, their strand orientation, and the exon locations. HiGlass gets all this information from a headerless table, where each row is a single gene, with the following columns:

# Description
1 Chromosome
2 Start position
3 End position
4 Gene symbol
5 Score
6 Strand
7 Union ID
8 Gene ID
9 Gene type
10 Gene description
11 CDS start
12 CDS end
13 Comma-separated list of exons starts
14 Comma-separated list of exons ends

Most of these columns are pretty straightforward, but some may not be obvious at first.

  • Column 5 isn’t important for this example, but, one can use this as an importance column. This will help HiGlass determine what gene symbols to display when you’re zoomed out and the HiGlass view gets crowded.
  • Column 7 is something that we’ll get to later, as it’ll help remove some non-unique annotations.
  • We don’t need to worry about columns 9 and 10, unless we want our annotation to be overflowing with information. If we’re just displaying gene locations, we don’t need to spend time filling these in, we can just leave them blank.
  • Columns 13 and 14 can be tricky to compute, depending on your input data format.

We can arrive at a file like this with little effort by starting with the RefSeq annotation2. Because the trickiest thing about getting the above table is likely the comma-separated lists of exon positions, we can look to a genomics resource like UCSC’s annotations for pre-built databases3. For the hg38 genome, we can download refGene.txt.gz.

refGene.txt.gz has these columns:

# Column Name Required for HiGlass
1 bin  
2 name
3 chromosome
4 strand
5 txStart
6 txEnd
7 cdsStart
8 cdsEnd
9 exonCount  
10 exon starts
11 exon ends
12 score  
13 gene ID
14 cds start status  
15 cds end status  
16 exon frames  

These 10 columns will provide us with all the data we need for the HiGlass table.

Reshaping the reference annotation to work with HiGlass

We can re-order the columns so they fit the HiGlass spec with some simple awk code.

zcat refGene.txt.gz | awk '{FS=OFS="\t"}{print $3, $5, $6, $13, ".", $4, $2, $2, ".", ".", $7, $8, $10, $11}' | sort -k1,1 -V -k2,2n > refGene.sorted.tsv

If you have a number of scaffold contigs in the reference genome that you don’t want to include, you can filter them out in the awk command with a regex. For example, many human reference genome scaffolds contain a _ in their chromosome name. We can filter those out with awk '{if (!($3 ~ /_/)) print ...}'.

refGene.sorted.tsv contains the information HiGlass needs, and it looks something like this:

> head refGene.sorted.tsv
chr1	11873	14409	DDX11L1	.	+	NR_046018	NR_046018	.	.	14409	14409	11873,12612,13220,	12227,12721,14409,
chr1	14361	29370	WASH7P	.	-	NR_024540	NR_024540	.	.	29370	29370	14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320,	14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370,
chr1	17368	17436	MIR6859-1	.	-	NR_106918	NR_106918	.	.	17436	17436	17368,	17436,
chr1	17368	17436	MIR6859-2	.	-	NR_107062	NR_107062	.	.	17436	17436	17368,	17436,
chr1	17368	17436	MIR6859-3	.	-	NR_107063	NR_107063	.	.	17436	17436	17368,	17436,
chr1	17368	17436	MIR6859-4	.	-	NR_128720	NR_128720	.	.	17436	17436	17368,	17436,
chr1	30365	30503	MIR1302-10	.	+	NR_036267	NR_036267	.	.	30503	30503	30365,	30503,
chr1	30365	30503	MIR1302-11	.	+	NR_036268	NR_036268	.	.	30503	30503	30365,	30503,
chr1	30365	30503	MIR1302-2	.	+	NR_036051	NR_036051	.	.	30503	30503	30365,	30503,
chr1	30365	30503	MIR1302-9	.	+	NR_036266	NR_036266	.	.	30503	30503	30365,	30503,

In refGene.sorted.tsv, there may be some genes with the same gene ID. Here’s an example:

chr5	179522456	179653239	C5orf60	.	+	NM_001142306	NM_001142306	.	.	179522482	179652444	179522456,179523416,179524019,179651574,179652293,179652904,	179522704,179523474,179524082,179651914,179652457,179653239,
chr5	179641543	179645046	C5orf60	.	-	NM_001142306	NM_001142306	.	.	179642338	179645020	179641543,179642325,179642868,179643429,179644028,179644798,	179641878,179642489,179643210,179643490,179644086,179645046,

You can see in this case that one version of this gene is transcribed from the + strand, and the other from the - strand. To merge these together, we can borrow some code from the HiGlass developers.

import collections as col

class GeneInfo:
    def __init__(self):
        pass


def merge_gene_info(gene_infos, gene_info):
    """
    Add a new gene_info. If it's txStart and txEnd overlap with a previous entry for this
    gene, combine them.
    """
    merged = False

    for existing_gene_info in gene_infos[gene_info.geneId]:
        if (
            existing_gene_info.chrName == gene_info.chrName
            and existing_gene_info.txEnd > gene_info.txStart
            and gene_info.txEnd > existing_gene_info.txStart
        ):

            # overlapping genes, merge the exons of the second into the first
            existing_gene_info.txStart = min(
                existing_gene_info.txStart, gene_info.txStart
            )
            existing_gene_info.txEnd = max(existing_gene_info.txEnd, gene_info.txEnd)

            for (exon_start, exon_end) in gene_info.exonUnions:
                existing_gene_info.exonUnions.add((exon_start, exon_end))

            merged = True

    if not merged:
        gene_infos[gene_info.geneId].append(gene_info)

    return gene_infos

fh_in = open("refGene.sorted.tsv", "r")
fh_out = open("gene-annotations.bed", "w")

# initialize a list that will contain GeneInfo objects
gene_infos = col.defaultdict(list)

# parse the input file
for line in fh_in:
    words = line.strip().split("\t")
    gene_info = GeneInfo()

    try:
        gene_info.chrName = words[0]
        gene_info.txStart = words[1]
        gene_info.txEnd = words[2]
        gene_info.geneName = words[3]
        gene_info.score = words[4]
        gene_info.strand = words[5]
        gene_info.refseqId = words[6]
        gene_info.geneId = words[7]
        gene_info.geneType = words[8]
        gene_info.geneDesc = words[9]
        gene_info.cdsStart = words[10]
        gene_info.cdsEnd = words[11]
        gene_info.exonStarts = words[12]
        gene_info.exonEnds = words[13]
    except:
        print("ERROR: line:", line, file=sys.stderr)
        continue

    # for some reason, exon starts and ends have trailing commas, remove them
    gene_info.exonStartParts = gene_info.exonStarts.strip(",").split(",")
    gene_info.exonEndParts = gene_info.exonEnds.strip(",").split(",")
    # get set of exon start-end tuples
    gene_info.exonUnions = set(
        [
            (int(s), int(e))
            for (s, e) in zip(gene_info.exonStartParts, gene_info.exonEndParts)
        ]
    )

    # add this gene info by checking whether it overlaps with any existing ones
    gene_infos = merge_gene_info(gene_infos, gene_info)

# close input file handle
fh_in.close()

# write each to the output file
for gene_id in gene_infos:
    for contig in gene_infos[gene_id]:
        output = "\t".join(
            map(
                str,
                [
                    contig.chrName,
                    contig.txStart,
                    contig.txEnd,
                    contig.geneName,
                    contig.score,
                    contig.strand,
                    "union_" + gene_id,
                    gene_id,
                    contig.geneType,
                    contig.geneDesc,
                    contig.cdsStart,
                    contig.cdsEnd,
                    ",".join([str(e[0]) for e in sorted(contig.exonUnions)]),
                    ",".join([str(e[1]) for e in sorted(contig.exonUnions)]),
                ],
            )
        )
        fh_out.write(output + "\n")

# close output file handle
fh_out.close()

That will take the C5orf60 gene from before, and merge it into this new line:

# before merging
chr5	179522456	179653239	C5orf60	.	+	NM_001142306	NM_001142306	.	.	179522482	179652444	179522456,179523416,179524019,179651574,179652293,179652904,	179522704,179523474,179524082,179651914,179652457,179653239,
chr5	179641543	179645046	C5orf60	.	-	NM_001142306	NM_001142306	.	.	179642338	179645020	179641543,179642325,179642868,179643429,179644028,179644798,	179641878,179642489,179643210,179643490,179644086,179645046,

# after merging
chr5	179522456	179653239	C5orf60	.	+	union_NM_001142306	NM_001142306	.	.	179522482	179652444	179522456,179523416,179524019,179641543,179642325,179642868,179643429,179644028,179644798,179651574,179652293,179652904	179522704,179523474,179524082,179641878,179642489,179643210,179643490,179644086,179645046,179651914,179652457,179653239

Now, our output file gene-annotations.bed contains all the information we need. Finally, we use clodius to aggregate this BED-like file into a database that’s readable by HiGlass. To do this, we’ll need to sort out gene annotations, and download a genome reference file that contains all the chromosomes in the reference, as well as their sizes. For the hg38 genome, for example, you can download this file from UCSC.

# sort by position, prior to aggregating
sort -k1,1 -V -k2,2n gene-annotations.bed > gene-annotations.sorted.bed

# download chromosome sizes
curl -O https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
# remove chromosomes with "_" in their name, like earlier, and sort them by name instead of size
grep -V "_" hg38.chrom.sizes | sort -k1,1 > hg38.chrom.sizes.sorted.tsv

# aggregate with clodius
clodius aggregate bedfile \
    --max-per-tile 10 \
    --chromsizes-filename hg38.chrom.sizes.sorted.tsv \
    --output-file gene-annotations.beddb \
    --delimiter $'\t' \
    gene-annotations.sorted.bed

Finally, our new file, gene-annotations.beddb, is the new reference annotation we’ve needed!

Installing HiGlass

For simplicity’s sake, we’re going to run HiGlass in a Jupyter notebook. We can install all the dependencies needed for HiGlass by making a blank conda environment and following HiGlass’ install instructions.

conda create -n HiGlass
conda activate HiGlass

# HiGlass has packages on PyPI, but I've created conda recipes for them as well
conda install -c jrhawley -y higlass-python clodius jupyter

# link HiGlass notebook extension to Jupyter
jupyter nbextension install --py --sys-prefix --symlink higlass
jupyter nbextension enable --py --sys-prefix higlass

# start jupyter
jupyter notebook

In Jupyter, create a new notebook, say higlass.ipynb.

Creating a new Jupyter notebook -80%

Loading the reference annotation in HiGlass

We can now load our gene-annotations.beddb into higlass.ipynb display it as we would any other reference genome. Following the Jupyter notebook example, we can put the following in a cell:

import higlass
from higlass.client import View, Track
from higlass.tilesets import beddb, chromsizes

# load the annotation
genes = beddb("Data/hg38/gene-annotations.beddb")
# load the chromosome sizes to use as a ruler
chrom_sizes = chromsizes("hg38.chrom.sizes.sorted.tsv")

# create a Track for the gene annotations
gene_annots = Track(
    track_type="horizontal-gene-annotations",
    tileset=genes,
    position="top",
    height=150,
)

# create a Track for the gene annotations
chrom_labels = Track(
    track_type="horizontal-chromosome-labels",
    tileset=chrom_sizes,
    position="top",
    height=150,
)

# create a HiGlass server to display the data
display, server, viewconf = higlass.display([
    View(
        tracks=[chrom_labels, gene_annots],
        initialXDomain=[
            nc.chr_pos_to_genome_pos("chr21", 20000000, hg38),
            nc.chr_pos_to_genome_pos("chr21", 26000000, hg38)
        ],
    )
])

# show the interactive display
display

HiGlass display with our new reference genome -100%

And that’s it! You can now move around the genome, zoom in and out, load data like ChIP-seq tracks, RNA-seq data, Hi-C contact matrices, and more, all using your new reference genome annotation as a guide.

Conclusion

HiGlass is a very useful tool for creating interactive visualizations of genomic data, particularly Hi-C maps. If you’re using Hi-C in a novel organism, have your own reference genome that you want to display, or don’t have constant internet access to use the pre-built reference online, it can take some work to create an annotation that can be displayed alongside your data. But with the steps above, hopefully it is clear how you can create and load your own genome annotation into HiGlass.

Footnotes

  1. I know the hg38 genome is very well-studied. But this also means a lot of data is either already known to many users or readily accessible. The purpose of this post is not to focus on the individual reference, but to focus on the process of how to build the annotation for HiGlass. 

  2. There are differences between different reference genome annotations, such as RefSeq (where we get RefGene from), GENCODE, HAVANA, and others. While they all use the same underlying reference genome build, each annotation may have different numbers of genes (e.g. only protein-coding, pseudo-genes, only accepting certain types of evidence for transcription start sites). Different annotations may also have different “Gene IDs”, but that’s a story for another time. If you want to use a specific reference annotation, you can, it just takes more work to get to the end result that we need. 

  3. This resource contains reference genomes and annotations for dozens of organisms, which should cover most use cases. If you are using your own reference, then you may have to calculate this yourself.