Don't trust your data at first glance

Published: November 23, 2018   |   Updated: December 04, 2018   |   Read time:

Tagged:

Analyzing data, particularly large datasets, requires a lot of up front work before you can even begin to ask the questions you wanted to ask in the first place. All this pre-processing, data normalization, batch correction, and bias removal, is required to reduce noise in your data so that you focus on (hopefully) only the effect you’re trying to study.

But this approach of cleaning and validating your data doesn’t just apply to data you generate, it also applies to data that you get from others. And it doesn’t matter whether those “others” are colleagues, your own work from a year ago, or industrial suppliers. Running a series of common-sense and validation checks on this data before you try and make use of it may be a bit annoying, but it’s worth it in the long run because of all the headaches it saves.

Take one of the most notorious errors in genomics data, for example: the “off-by-1” error for genomic coordinates. For those unfamiliar with this problem, this is the “you’re missing a semicolon” issue for genomic data – it take you hours to figure out why your code isn’t working, the fix is extremely simple, and your code magically works afterwards. It’s often not the fault of the data provider, but an error in interpretation. The person using that data expects coordinates to be 0-indexed, but they’re actually 1-indexed, or vice versa, and because of that assumption, everything goes haywire. It’s only once you go back and look at your original data source do you realize where the mistake was.

The validations you need to run are often context- and data-dependent. Different file formats have different specifications, gene IDs are dependent on the database that they come from, date formats may vary by locale, etc.

I’m going to give a recent example of why you shouldn’t immediately trust data from others, even those you trust (frankly, it’s the entire reason for writing this post in the first place).

Case example: a list of targeted regions for DNA sequencing

I’m studying DNA methylation in cancer, and for this, a collaborator had generated some data using SeqCap Epi CpGiant Probes from Roche1. At the time of this post, this page provided files for the hg19 genome that listed the genomic regionst that this kit targeted, as well as a list of all the cytosines in C-G dinucleotides (CpGs) within these target regions.

I used this list of CpGs as a basis to map methylation values to, so that I could compare a few samples against each other. I wanted to compare them this way instead of directly comparing samples to each other since I had a lot of samples, and an intersection of coordinates across a lot of samples is a computationally intensive process.

What I found, though, was that almost none of my samples had methylation calls overlapping the > 5.5M CpGs listed in Roche’s target list. This, obviously, was extremely odd. I checked that the genome version was correct, I checked for the target BED file for the “off-by-1” error (their coordinates correctly abided by the BED file format specification2), and I checked that the strands were properly labelled. Nothing was incorrect here, so I still didn’t know what was wrong.

I opened IGV to see how my samples matched to the target list, and why they didn’t match up. Below is a figure of what I saw.

IGV screenshot -80%

Notice how the regions on the + strand (blue rectangles with arrows to the right) overlap G’s in CpG dinucleotides, and the regions on the - strand are adjacent, but overlap a mixed bag of nucleotides.

This list of CpGs isn’t doing what I assumed it did – overlap CpGs.

Now, IGV is 1-indexed and BED files are 0-indexed. IGV has been around a long time and knows how to handle BED files properly, so I trusted this. But I wanted to be extra extra sure that Roche had an error in their target list.

To do this, I explicitly wanted to check how many of these “cytosines” actually mapped to cytosines in the hg19 genome.

# reference hg19 genome used and validated by our genomics facility
$ REF="/path/to/hg19/genome.fasta"
# BED file from Roche that lists all the CpGs
$ BED="130912_HG19_CpGiant_4M_EPI_CpG.bed"

# command to extract the sequence in the FASTA file at the coordinates listed in the BED file
$ bedtools getfasta -fi $REF -bed $BED -fo check-off-by-1.original.fasta -s

# count the number of unique occurrences of a given letter in the resulting FASTA file
$ grep "^[Aa]" check-off-by-1.original.fasta | wc -l
509777
$ grep "^[Cc]" check-off-by-1.original.fasta | wc -l
954897
$ grep "^[Gg]" check-off-by-1.original.fasta | wc -l
3701597
$ grep "^[Tt]" check-off-by-1.original.fasta | wc -l
458825

This confirms on a global scale what I saw in that screenshot. Namely, that this list of “cytosines” doesn’t actually map to cytosines.

From that screentshot, it appears to be the case that these positions are 1 nt too large. The sites on the + strand are covering the G in the CpG, not the C, and the - strand sites are right after that. I can shift each of these coordinates back by 1, and check these new coordinates to see how they map.

# shift each coordinate back by 1, using awk
$ awk -v FS='\t' -v OFS='\t' '{print $1, $2 - 1, $3 - 1, $4, $5, $6}' $BED > hg19_CpGs.corrected.bed

# extract sequences from hg19 reference
$ bedtools getfasta -fi $REF -bed hg19_CpGs.corrected.bed -fo check-off-by-1.corrected.fasta -s

# count the number of unique occurrences of a given letter in the resulting FASTA file
$ grep "^[Aa]" check-off-by-1.corrected.fasta | wc -l
0
$ grep "^[Cc]" check-off-by-1.corrected.fasta | wc -l
5625096
$ grep "^[Gg]" check-off-by-1.corrected.fasta | wc -l
0
$ grep "^[Tt]" check-off-by-1.corrected.fasta | wc -l
0

# check how many lines there are in the target list of CpGs
$ wc -l hg19_CpGs.corrected.bed
5625096 hg19_CpGs.corrected.bed

This shows that every single position listed in the corrected list mapped to a cytosine (a proper validation for this file). Not a single one listed mapped to a non-cytosine, which shows that this is an extremely strong indicator that Roche’s original list is wrong.

And just like that, my problems were solved. My code started working, my samples had methylation data that covered all 5.6M of these sites, and everything was good again.

Conclusion

It wasn’t obvious that it was wrong, when I started looking at this data. The original list met all the other validation checks I ran when I first started running into problems, so it wasn’t clear what was up – none of the obvious cuplrits were guilty. But it took me a few hours to finally figure out what was wrong and why my code wasn’t working as I thought it was supposed to. I’ve since contaced Roche’s technical support about this, to make sure this error is fixed, so hopefully they get on that quickly.

I really want to highlight here that my only mistake was that I assumed the data provided by the manufacturer was correct.

To be fair, you can’t run every single possible validation step to make sure your data is completely correct before trying something with it. That’s madness and you’ll never get anything done. But it’s important to keep in mind that you shouldn’t trust any data as perfectly valid. There are always ways in which it can be wrong (some more likely than others), and the weirder the problems you encounter, the more outlandish and unlikely your validation checks have to be.

So be careful out there. The data is dark and full of errors.

Update: 2018-12-04

I’ve made contact with Roche’s tech support, and we’ve resolved the issue. They haven’t replaced the manifest files on the website, yet, but I do have their released fixes now. They were pretty helpful and everything went smoothly, but my overall point still remains that you can’t just trust what others have done without checking for yourself.

References & Footnotes