Genomic epidemiology: it's all about the data...
Getting data from the sequencer to the epidemiologists takes many steps, and requires careful planning from the policy and data perspectives.
Thinking about tackling integrating sequencing data into your epidemiology work at your agency? There are some critical decisions that need to be made!
Where the data will live
What data elements are important
How it will get there
Under what authority will it be collected
Where the data will live
One key prerequisite to performing data analysis is…data! When working on genomic epidemiology in a public health agency, bringing together the sequencing data with the epidemiologic data is often one of the key challenges. Sequence data is often stored in cloud environments and then eventually in public databases. Typically these environments are not appropriate for bringing in clinical or epidemiologic data for security reasons. So how do we bring these two critical datasets together in a routine manner in a public health agency?
Disease surveillance systems house critical information such as patient demographics, exposure information, case classification, hospitalization and death information, and laboratory results. Basically all the data that you would like to see alongside your sequencing data! Disease surveillance systems also have the additional advantage of being highly access controlled and typically very customizable for field permissions. This means that you can use existing access controls to manage things like who can view sequencing information for what cases, and also who is allowed to add/edit/delete information.
For these reasons, disease surveillance systems are an excellent place to store sequence-associated information. I chose my words carefully there - because they are not a place to store actual sequencing data! These files can be very large, and they are not what these systems were designed to hold. So if we’re not storing sequence files in the disease surveillance system, what should we be storing?
What data elements are important
There are two main pieces of information that should be stored in the disease surveillance system:
Key genomic-derived information
Things like Salmonella serotype, norovirus genotype, key AMR genes, etc.
Ideally these should be things that are not subject to frequent changes
SARS-CoV-2 Pango lineages are an example that can change over time as bioinformatics tools are updated
If there are expected to be changes, there needs to be a plan in place for how to update and how frequently those updates would happen
Where possible, track the version/dates so users are aware
Sequence database accession number
I think this is the most important piece!
Having this ID sent to the disease surveillance database unlocks all further analyses and allows you to export and merge data from your epi database and sequencing database
It’s easiest if the sequences can be put in an open and public database like NCBI, but it can point to a closed sequence repository if needed
Each agency’s system is going to look a little different, but in my example below, samples come into the public health laboratory and are entered in the Laboratory Information Management System (LIMS) and given an accession number. That accession along with the initial clinical results (SARS-CoV-2 detected for example) is sent via electronic lab report to the disease surveillance system. In parallel, sequencing is performed, analyzed and the data submitted to a public sequence repository. Information from these processes are entered back into the LIMS system - this may include data from the bioinformatics analysis such as AMR genes detected, as well as a database ID for the sequence database. Typically the actual clinical accession number is not what is entered into the public sequence databases, so it’s important to capture both numbers. Once all this data is back in the LIMS, it can be transmitted by electronic laboratory report to the disease surveillance system.
Once the key data pieces are in the disease surveillance system, this unlocks a world of possibilities! Pat yourself on the back, and appreciate that you’ve put in the hard prep work and can now move onto the fun part - analysis! You can build phylogenetic trees and match up your exported epi metadata, build visualizations of sequence types over time, run analyses on hospitalizations status for different variants, etc. Whatever genomic epi questions you have, you now have the the ability to start exploring them.
This exact setup may not be what works in your agency, but the exercise of mapping out the data systems, looking for the data keys that link disparate systems, and working through routes of moving data around is always a great first step.
How it will get there
So, you’ve established that you want to put some sequence information into your disease surveillance system, now how do you get it there? The most streamlined method of getting the data in is through established electronic laboratory reporting (ELR) - CDC even has guidelines for this in the SARS-CoV-2 context. This has several advantages - it avoids creating a unique process, it makes it easier for laboratories that are already familiar with ELR, and it is a scalable solution that can be adopted nationwide rather than each state setting up a unique process.
While this happy path is ideal, there are some barriers. One is that not all sequencing labs are clinical labs and may not be established ELR submitters. At least for the medium-term, other options such as secure file transfer systems will need to be maintained.
The policy angle
As important as the mechanics of setting up data transfer is, the policy angle is just as critical. It’s one thing to work internally on having the public health laboratories transmitting sequencing information, but getting sequencing data from clinical, commercial, and academic laboratories in another matter - because it’s generally not required. Each state maintains disease reporting regulations, which specify what data is required to be reported for each condition, and which physical specimens have to be submitted to the public health laboratory.
As sequencing becomes more common not only in academic and public health laboratories, there are many scenarios that need to be considered. What happens if an academic laboratory starts sequencing an organism that is also required to be submitted to the public health lab, like Salmonella. Will that sequencing data be shared with public health? Will public health also sequence the same isolate in duplicate (leading to duplication in public sequence repositories)? Can the clinical/academic lab submit sequence data in lieu of an isolate? Regulation updates will need to be forward-thinking and written to be as flexible as possible to cover evolving technologies. For some states, updating these regulations can be done by the health commissioner or secretary, but for many states it requires a legislative process, which can be many years in the making.
During COVID, California took the step of requiring reporting of sequencing results. Taking this step required not only the regulatory piece, but setting up a mechanism for the labs to report. It is critical prior to changing regulations to require reporting, that the mechanism of how and what labs should submit is well thought out and clearly communicated.
A laboratory that performs genetic sequencing of SARS-CoV-2 shall submit sequence data to the Department in an electronic format specified by the Department. In addition, a laboratory that identifies a SARS-CoV-2 strain designated as a variant of public health importance by the Department shall transmit the report in a format specified by the Department to the state electronic reporting system or local electronic reporting system that this linked to the state electronic reporting system. The sequence data submission and the strain report shall include the information specified under the HOW TO REPORT section on page 3 of this document and if applicable, the federal Clinical Laboratory Improvement Amendments (CLIA) certificate number.
- Title 17, California Code of Regulations (CCR), Section 2505
Take-home messages
Another important piece to remember in these discussions is that each of these isolates or samples comes from a person, and ensuring that the sequencing data is used for legitimate public health purposes, and stored securely is critical. This has come out particularly in regards to sharing of HIV sequence information, where there are theoretical possibilities of legal action (see NY Times article). As public health sequencing efforts expand, this dimensions needs to be be part of the considerations.
Effective use of sequencing data in public health requires addressing a combination of policy, data, analysis, and communications questions. Because addressing some of these challenges can take a long time, early up-front planning to identify all relevant stakeholders, processes, and potential barriers is important for speeding up implementation. Less important than which particular systems are used, is ensuring that all systems have compatible identifiers which allow for combined analysis. By bringing together these disparate data sources, sequencing data can be used to its fullest potential in protecting public health.
Note: my experience and this article is based on the US public health system, some of the terminology, regulations and systems may be different outside the US.
Join the conversation here in the comments below, or on LinkedIn.