Creating simple genome annotation tables

Published: June 19, 2020   |   Read time:

Tagged:

Image Attribution:

Biological organisms have genomes of all shapes and sizes. Some genomes are relatively simple, like phi-X bacteriophage, whose genome is a single circular strand of DNA that is about 5000 bases long and codes for 11 proteins. Other genomes are much more complex. The human reference genome is 3 billion base pairs long, split across 23 pairs of chromosomes, and codes for ~ 20 000 proteins. The process of determining which parts of a genome code for proteins, produce RNA transcripts, form telomeric or other structural regions, etc, is called “annotation”, and is a difficult process. Much of modern bioinformatics relies on the reference annotations that consortia of scientists have built over the past few decades, and the complexity of this annotation process is reflected in the complexity of the annotations, themselves.

Annotations for these genomes have been created through extensive manual work and algorithmic annotation, and there are often language-specific packages built for interacting with these annotations. A common package that works with a large number of databases is BioMart (and its biomaRt R package), but I have always found working with this package confusing. Its complex nomeclature and non-intuitive functions make it hard to understand what you’re doing at times, even if you’re and experienced programmer.

In my analyses, I prefer to work with tabular data in most cases. The structure is intuitive and interacting with the data is often consistent across very different downstream applications. What I want to do with this post is to demonstrate how to create simple and understandable tables of reference annotations, starting from a source other than BioMart. This is the process I use in many of my projects to easily arrive at immediately usable tables.

Desired end result

For this post, I’m going to be using the GRCh38 human reference genome from GENCODE to produce two tables; one will contain gene-level information, and the other will contain transcript-level information.

For genes, I want a table with the following columns:

  1. chromosome
  2. start position
  3. end position
  4. strand
  5. ENSEMBL gene ID
  6. common gene name

For transcripts, I want something similar:

  1. chromosome
  2. start position
  3. end position
  4. strand
  5. ENSEMBL gene ID
  6. common gene name
  7. ENSEMBL transcript ID
  8. common transcript name

Having these columns makes it very easy to work with many downstream applications, and having the gene ID in each table makes it easy to map back and forth between genes and transcripts, depending on which level of granularity you want.

GENCODE reference annotation

GENCODE maintains annotations for human and mice genomes. It “merges the manual gene annotation produced by the Ensembl-Havana team and the Ensenbl-genebuild automated gene annotation”1, but the main reason I like it is its clarity. GENCODE has clear documentation on what it is, how to access data, what format the data is in, definitions for values in its annotations, and more.

To start creating our tables, we can download the latest comprehensive gene annotation. For most purposes, we only need the canonical reference chromosomes, so we can start there.

Downloading the GRCh38 annotation GTF file from GENCODE -100%

The structure of a GTF file

The “Data Format” page clearly details what’s in the annotation. We can also look at the first few lines of the annotation to see what’s going on.

> zcat gencode.v34.annotation.gtf.gz | head -n 18
##description: evidence-based annotation of the human genome (GRCh38), version 34 (Ensembl 100)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2020-03-24
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.1";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  transcript      12010   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000002844.1";
chr1    HAVANA  exon    12010   12057   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000002844.1";
chr1    HAVANA  exon    12179   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000002844.1";
chr1    HAVANA  exon    12613   12697   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 3; exon_id "ENSE00001758273.2"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000002844.1";
chr1    HAVANA  exon    12975   13052   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 4; exon_id "ENSE00001799933.2"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000002844.1";
chr1    HAVANA  exon    13221   13374   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 5; exon_id "ENSE00001746346.2"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000002844.1";
chr1    HAVANA  exon    13453   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 6; exon_id "ENSE00001863096.1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.1"; havana_transcript "OTTHUMT00000002844.1";
chr1    HAVANA  gene    14404   29570   .       -       .       gene_id "ENSG00000227232.5"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; level 2; hgnc_id "HGNC:38034"; havana_gene "OTTHUMG00000000958.1";

Things that are important to know for processing are:

  • the first 5 lines are annotation metadata
  • the rows are tab-separated
  • the important metadata is listed in column 9 as key-value pairs
    • values are contained in quotes
    • pairs are separated by the ; characters

To efficiently process this text file, we can use awk to operate on the columns within in each row and extract the relevant information.

Building an Awk script to parse each row

awk works on each row for a text file. You can keep track of the row number using the NR variable (“record number”) and you can manipulate columns N with the $N variable (e.g. column 1 is $1, column 2 is $2, etc). As noted, the 9th column is separated by ; instead of tabs like the rest of the table, but we can use regular expressions to get awk to expand the 9th column, so we can treat every key-value pair as its own column. We can skip the first 5 lines and expand the 9th column by setting the field-separator variable, FS to accept tabs or ; :

# just read column 9, no alterations. awk infers all whitespace as column/field separators
> zcat gencode.v34.annotation.gtf.gz | awk '{if (NR > 5) print $9}' | head -n 1
gene_id
# read column 9, accounting for `; `
> zcat gencode.v34.annotation.gtf.gz | awk 'BEGIN{FS="\t|; "}{if (NR > 5) print $9}' | head -n 1
gene_id "ENSG00000223972.5"

For genes, we can see from above that those are listed in the rows where column 3 is gene. Transcripts are similarly identified.

# no filtering, print columns 1, 3, 4, 5, 7, 9, and 10
> zcat gencode.v34.annotation.gtf.gz | awk 'BEGIN{FS="\t|; "}{if (NR > 5) print $1, $3, $4, $5, $7, $9, $10}' | head -n 3
chr1 gene 11869 14409 + gene_id "ENSG00000223972.5" gene_type "transcribed_unprocessed_pseudogene"
chr1 transcript 11869 14409 + gene_id "ENSG00000223972.5" transcript_id "ENST00000456328.2"
chr1 exon 11869 12227 + gene_id "ENSG00000223972.5" transcript_id "ENST00000456328.2"

# filtering for genes
> zcat gencode.v34.annotation.gtf.gz | awk 'BEGIN{FS="\t|; "}{if (NR > 5 && $3 == "gene") print $1, $3, $4, $5, $7, $9, $10}' | head -n 3
chr1 gene 11869 14409 + gene_id "ENSG00000223972.5" gene_type "transcribed_unprocessed_pseudogene"
chr1 gene 14404 29570 - gene_id "ENSG00000227232.5" gene_type "unprocessed_pseudogene"
chr1 gene 17369 17436 - gene_id "ENSG00000278267.1" gene_type "miRNA"

# filtering for transcripts
> zcat gencode.v34.annotation.gtf.gz | awk 'BEGIN{FS="\t|; "}{if (NR > 5 && $3 == "transcript") print $1, $3, $4, $5, $7, $9, $10}' | head -n 3
chr1 transcript 11869 14409 + gene_id "ENSG00000223972.5" transcript_id "ENST00000456328.2"
chr1 transcript 12010 13670 + gene_id "ENSG00000223972.5" transcript_id "ENST00000450305.2"
chr1 transcript 14404 29570 - gene_id "ENSG00000227232.5" transcript_id "ENST00000488147.1"

To clean the data further, we can use the gsub command to remove the stuff around the values we want, which are surrounded by quotes.

# no cleaning
> zcat gencode.v34.annotation.gtf.gz | awk 'BEGIN{FS="\t|; "}{if (NR > 5 && $3 == "gene") print $1, $3, $4, $5, $7, $9}' | head -n 3
chr1 gene 11869 14409 + gene_id "ENSG00000223972.5" gene_type "transcribed_unprocessed_pseudogene"
chr1 gene 14404 29570 - gene_id "ENSG00000227232.5" gene_type "unprocessed_pseudogene"
chr1 gene 17369 17436 - gene_id "ENSG00000278267.1" gene_type "miRNA"

# removing `gene_id "` before value in column 9
> zcat gencode.v34.annotation.gtf.gz | awk 'BEGIN{FS="\t|; "}{if (NR > 5 && $3 == "gene") {gsub(/gene_id "/, "", $9); print $1, $3, $4, $5, $7, $9} }' | head -n 3
chr1 gene 11869 14409 + ENSG00000223972.5"
chr1 gene 14404 29570 - ENSG00000227232.5"
chr1 gene 17369 17436 - ENSG00000278267.1"

# removing `gene_id "` before value in column 9 and `"` after
> zcat gencode.v34.annotation.gtf.gz | awk 'BEGIN{FS="\t|; "}{if (NR > 5 && $3 == "gene") {gsub(/gene_id "/, "", $9); gsub(/"/, "", $9); print $1, $3, $4, $5, $7, $9} }' | head -n 3
chr1 gene 11869 14409 + ENSG00000223972.5
chr1 gene 14404 29570 - ENSG00000227232.5
chr1 gene 17369 17436 - ENSG00000278267.1

By repeating this cleaning process on each of the relevant columns, we can cleanly extract only the data we want.

Putting it all together

To filter genes and transcripts, we can create the following two awk scripts2:

Then, we can run them both on the GENCODE annotation file to produce our well-formatted output tables.

> zcat gencode.v34.annotation.gtf.gz | awk -f gene.awk > genes.bed
> zcat gencode.v34.annotation.gtf.gz | awk -f transcript.awk > transcripts.bed

> head genes.bed
chr1    11869   14409   +       ENSG00000223972.5       DDX11L1
chr1    14404   29570   -       ENSG00000227232.5       WASH7P
chr1    17369   17436   -       ENSG00000278267.1       MIR6859-1
chr1    29554   31109   +       ENSG00000243485.5       MIR1302-2HG
chr1    30366   30503   +       ENSG00000284332.1       MIR1302-2
chr1    34554   36081   -       ENSG00000237613.2       FAM138A
chr1    52473   53312   +       ENSG00000268020.3       OR4G4P
chr1    57598   64116   +       ENSG00000240361.2       OR4G11P
chr1    65419   71585   +       ENSG00000186092.6       OR4F5
chr1    89295   133723  -       ENSG00000238009.6       AL627309.1

> head transcripts.bed
chr1    11869   14409   +       ENSG00000223972.5       DDX11L1         ENST00000456328.2     DDX11L1-202
chr1    12010   13670   +       ENSG00000223972.5       DDX11L1         ENST00000450305.2     DDX11L1-201
chr1    14404   29570   -       ENSG00000227232.5       WASH7P          ENST00000488147.1     WASH7P-201
chr1    17369   17436   -       ENSG00000278267.1       MIR6859-1       ENST00000619216.1     MIR6859-1-201
chr1    29554   31097   +       ENSG00000243485.5       MIR1302-2HG     ENST00000473358.1     MIR1302-2HG-202
chr1    30267   31109   +       ENSG00000243485.5       MIR1302-2HG     ENST00000469289.1     MIR1302-2HG-201
chr1    30366   30503   +       ENSG00000284332.1       MIR1302-2       ENST00000607096.1     MIR1302-2-201
chr1    34554   36081   -       ENSG00000237613.2       FAM138A         ENST00000417324.1     FAM138A-201
chr1    35245   36073   -       ENSG00000237613.2       FAM138A         ENST00000461467.1     FAM138A-202
chr1    52473   53312   +       ENSG00000268020.3       OR4G4P          ENST00000606857.1     OR4G4P-201

Conclusion

It’s often much easier to work with simple tabular data than complex databases. And moving small text files between computers or clusters is typically easier than large database files. Using two simple awk scripts, we can parse out annotation data from a large GTF file into an easily readable file, that can be used in almost any downstream application. This simplicity keeps things simple for the end user3 and also makes it easy to incorporate into automated pipelines with known output formats.

I did the above with the GRCh38 genome from GENCODE, but in practice, you could swap this annotation with any reference annotation in GTF format, and the results should be similar. It should also be clear how generalize the above steps if you want different output columns or want them in a different order.

References & Footnotes

  1. GENCODE FAQs 

  2. GitHub Gist for gene.awk and transcript.awk files 

  3. Biological data is already complicated enough, it really helps to KISS