We present here six considerations of the role of public health informatics in the COVID-19 pandemic. These represent a broad range of topics, from information systems for the monitoring and dissemination of accurate information to the public, to leveraging existing evidence currently available in a huge corpus of virus infection- and pandemic-related research, to building more realistic models of disease risk, spread, and effect of societal interventions, to as-yet poorly understood post-pandemic effects on public health.
Information systems for COVID-19 monitoring
A critical need for any strategy that addresses COVID-19 is adequate disease monitoring. At the level of cases and deaths, several efforts around the world have arisen to maintain and display official counts, including by researchers at Johns Hopkins University (https://coronavirus.jhu.edu/map.html) and reporters at the New York Times (https://www.nytimes.com/interactive/2020/world/coronavirus-maps.html). These and other efforts rely on reports obtained from heterogeneous sources, many of which capture and store data differently, requiring that informaticians process and display data effectively. Case and death counts are helpful and widely used by healthcare systems, policy makers, governmental institutions, and the general public. However, they are notoriously biased given the differing availability and use of lab-based tests to determine COVID-19 case status at various locations.
More comprehensive efforts to track the true impact of COVID-19 necessitate appropriate wide-scale testing of SARS-CoV-2. Knowledge of who carries the virus regardless of symptom or disease status enables efficient prevention of further transmission, the proper identification of risk factors that lead to divergent symptoms, and adequate preparation of healthcare systems to treat patients who are carriers while minimizing risk to providers and patients who have not been infected. Design and deployment of population-level testing should be a primary goal for the effective containment of COVID-19. In conjunction with apps developed by informaticists, contact tracing along with case isolation can proceed effectively to control outbreaks [8]. Such efforts are thought to have curtailed the spread of COVID-19 in Singapore and South Korea. Because it is unlikely in countries like the U.S., that the federal or local governments, or many citizens would use contact tracing without ensuring individual-level data is safeguarded, various informaticists are engaged in efforts to create privacy-preserving contact tracing apps.
SARS-CoV-2 containment was not successful in most countries, due in part to lack of appropriate wide-scale testing which contributed to its undetected transmission. Ultimately, nothing can replace appropriate lab-based viral testing to understand disease transmission, but informatics solutions are helpful to partly overcome testing inadequacies. In the U.S., Canada, and Mexico, COVID Near You (https://covidnearyou.org/) is a citizen participation platform via which any person can contribute their current health status as it relates to COVID-19 symptoms and test results. Aggregation of this individual-level data is being used to track population-level health in real-time. Other data that can be used to fill monitoring gaps includes search engine data (e.g., Google queries for COVID-19-related terms), and to a lesser extent, social media data (e.g., Twitter posts related to COVID-19). Informaticists are leading and contributing to such efforts around the world.
As results of SARS-CoV-2 tests, along with serological assays to detect its seroconversion, become more widely available, retrospective studies can proceed to more accurately determine how COVID-19 spreads and how many true cases existed prior to widespread testing. Informaticians can participate in these efforts that require accounting for test characteristics (sensitivity/specificity) and comparing the characteristics of patients who were actually tested versus those of the underlying population. Ongoing retrospective analyses such as these are critical to gain knowledge necessary to avoid future resurgences of COVID-19.
Systems for disseminating accurate information related to COVID-19 to the public
An emerging issue that concerns the prevention of COVID-19 is the widespread dissemination of speculation, rumors, half-truths, disinformation, and conspiracy theories by means of popular social media platforms. In order for policies, guidelines, and mandates, that may be updated on a weekly or even daily basis, to reach and be adopted by the general public it is important for relevant, vetted information sources to be clearly identified and potentially pointed to in response to misleading posts. In recent years there have been many exciting efforts to combine natural language processing (NLP), machine learning, and social media scraping to monitor clinical outcomes of interest such as foodborne illnesses [20]. There may be an opportunity to work towards adapting such informatics approaches to monitor and perhaps even combat the dissemination of ‘bad’ information through automated responses that redirect individuals to sources identified as reliable within the scientific community. Rule-based systems such as ‘expert systems’ could be combined with NLP technologies to construct such monitoring and response frameworks. Equally important is the consumer health informatics task of developing clear, concise, and easily navigable informational resources for COVID-19, that summarize up-to-date information and guidelines but also link summary information back to relevant primary sources, attempt to quantify the certainty/reliability of available information, and offer explanations of reasoning whenever such information or guidelines need to be updated.
Data visualization and analysis systems for rapid assessment of COVID-19 spread
The spread of infectious diseases such as COVID-19 provides a unique opportunity to assess the regional spread and progression of disease at a population level. Differences in pathogenic mechanisms of different diseases responsible for past pandemics imply that the spread of COVID-19 may not be completely predictable based on the observing historical rates of disease transmission. Data on the cumulative number of COVID-19 cases is available at country/regional/city levels and by studying the progression and spread of disease in regions affected close to the time of the initial outbreak, meaningful projections of infection rates can be made for areas which will be affected later. For example, by modeling daily regional cumulative COVID-19 cases, regional differences in the trends can illuminate the comparative effectiveness of different policy decisions and can identify countries and policies that have succeeded in slowing the rate of COVID-19 spread, providing evidence for the adoption of effective public health policies by areas still in the early phases of the pandemic. Presenting this information to the public using data visualization methods in an important informatics activity.
Synthesizing evidence to understand COVID-19 origins, spread, and prevention
As of April 11, 2020, there are more than 3700 manuscripts published or posted at PubMed, BioRxiv, and MedRxiv on COVID-19 from researchers all over the world (https://www.ncbi.nlm.nih.gov/research/coronavirus/). These manuscripts cover a wide spectrum of important topics that can help us to understand the critical aspects of clinical and public health impacts of COVID-19, including the disease mechanism, diagnosis, treatment, prevention, viral infection, replication, pathogenesis, transmission, viral host-range, and virulence. On the other hand, the amount of information is increasingly overwhelming for stakeholders, policymakers, researchers and interested parties to comprehend. A systematic review, which is a type of literature review that uses systematic methods to collect secondary data and critically appraise research studies, can be useful in synthesizing the existing evidence of COVID-19 related research findings. In particular, meta-analysis plays a central role in the systematic review in quantitatively synthesizing evidence from multiple scientific studies which address related questions.
Manual literature review is time consuming and, more importantly, it is challenging to keep up-to-date with the rapidly increasing volume of literature. Medical informatics tools can improve the efficiency and scalability of up-to-date evidence synthesis for COVID-19 related research. For example, clinical natural language processing (NLP) tools can be used for literature screening and information retrieval. Software such as Abstractr [21,22,23] and DistillerSR (https://www.evidencepartners.com/) has been used to reduce manual effort in literature screening. Beyond literature screening, DistillerSR is also a useful tool for the management of the multi-step workflow of systematic review process. Recently, DistillerSR made its tool freely available for systematic reviewers and researchers to conduct systematic reviews related to COVID-19. For meta-analysis, tools such as Comprehensive Meta-Analysis (CMA) (https://www.meta-analysis.com/), RevMan (https://training.cochrane.org/online-learning/core-software-cochrane-reviews/revman), and macros in Stata (https://www.stata.com/), are available for standard meta-analyses. However, for COVID-19 related research, more sophisticated methods are needed in order to address unique features related to this topic. For example, the quality of the reported findings in the above-mentioned 3700 manuscripts is expected to be highly heterogeneous, especially for those manuscripts that have not been peer-reviewed. It is critically important to properly account for such heterogeneity across studies. Furthermore, the reported findings may be subject to more severe publication bias and outcome reporting bias [24], as the analysis of the data and reporting of the analysis results are likely to be based on different protocols. Visualization tools, sensitivity analyses, and inference based on bias correction models can be useful in evaluating the quality of the evidence [25,26,27,28,29,30,31]. In addition, novel visualization tools, such as the tornado plot in a cumulative meta-analysis [32], will be valuable for presenting how the cumulative evidence on answering a COVID-19 related question evolves over time. R packages including ‘meta’, ‘metafor’, ‘metasens’, ‘netmeta’, ‘mvmeta’, mada’ and ‘xmeta’ are useful for advanced meta-analyses with these needs. Finally, online platforms for meta-analysis, such as programs with shiny interfaces, are in great need for offering convenience to COVID-19 researchers in summarizing and synthesizing results.
Advanced, more realistic models of disease spread to guide policymaking
Differential equation-based epidemiological models such as the Susceptible-Infected-Recovered (SIR) or Susceptible-Exposed-Infected-Recovered (SEIR) models and their variants are key workhorses for studying infectious disease dynamics. These models have been widely used in making projections and informing policy-makers in constructing mitigation strategies for the disease. One weakness of these models is that they treat individuals in a given population as homogeneous, with constant risk rates, exposure rates, infection rates, and recovery/death rates throughout the larger group. This is a gross oversimplification which is a primary factor of the models’ limited predictive accuracy. Statisticians have been engaging in COVID-19 efforts with statistical models using functional data or time series modeling techniques. These models often use covariates or latent factors to account for population heterogeneity and provide uncertainty quantification, thus improving on a weakness of the SEIR models. However, these models do not present the dynamic infectious disease process which may limit their interpretability and accuracy in forecasting. One key area of quantitative research that can emerge from this COVID-19 crisis is hybrid epidemiology-statistical models. That is, models based on SIR or SEIR frameworks that stochastically show the transition probabilities as differing according to person or environmental covariates, accounting for clustering effects, and effectively propagating uncertainty in the forecasting. These can combine the strengths of each type of model, and given the broad availability of large scale data on mobility, density, demographics, etc. that vary in different communities, they can produce much more realistic models and more accurate projections to guide policymaking.
Secondary effects of COVID-19 on public health and well-being
The COVID-19 pandemic has resulted in unprecedented disruption to the healthcare system. In addition to understanding the direct health impacts of the disease, there is a public health need to understand the secondary effects of COVID-19-related healthcare disruption on access to and timeliness of care for other urgent conditions, and resultant effects on health outcomes. Prioritizing healthcare resources for COVID-19 patients and efforts to depopulate healthcare settings in order to reduce healthcare-related disease transmission has resulted in reduced access to care for patients across the spectrum of clinical need and severity including delayed access to surgery for cancer patients, organ transplant recipients, and others with time-sensitive conditions. Public health informatics can play an important role in informing our understanding of how the effects of healthcare disruption propagate across a community, affecting access to care, and population health. Answering questions about the effect of healthcare disruption on population health requires three components: (1) access to data on healthcare utilization and outcomes, (2) data on timing and types of public health and hospital-level interventions, and (3) causal inference methodologies that support our ability to draw conclusions about the causal effects of these interventions. Data on health care utilization and outcomes can be obtained from a variety of sources including individual and multi-institutional EHR data and claims databases. Data on public health interventions are already being compiled by researchers, including national and international databases of policy changes (https://is.gd/CQs6th, https://is.gd/LvvUiz, https://is.gd/mlCu2I). Finally, disentangling the causal impacts of COVID-19 itself; interventions at the local, state, and federal level; and interventions and innovation at the individual health system level requires the rigorous implementation of study designs and analytic methods for causal inference. A number of techniques in common use in health services and econometrics research can be harnessed for this purpose including interrupted time series and difference-in-difference designs [33].
Real-time monitoring via social media
The total number of users of social media continues to grow worldwide, resulting in the generation of vast amounts of data. Popular social networking sites such as Facebook, Twitter, and Instagram dominate this sphere. About 500 million tweets and 4.3 billion Facebook messages are posted every day (https://www.gwava.com/blog/internet-data-created-daily). A Pew Research Report (http://www.pewinternet.org/fact-sheet/social-media/) states that nearly half of adults worldwide and two-thirds of all American adults (65%) use social networking. The report states that of the total users, 26% have discussed health information, and, of those, 30% changed behavior based on this information and 42% discussed current medical conditions. Advances in automated data processing, machine learning, and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media. When events such as the COVID-19 pandemic sweep the world, the public turns to social media. While there is a general belief that most of the content is not useful, adequate collection, filtering, and analysis could reveal potentially useful information for assessing public sentiment. Furthermore, given the delay and shortage of available testing in the United States, social media could provide a near real-time monitoring capability (e.g. the Penn COVID-19 U.S. Twitter map, https://is.gd/L58ggA), giving insights into the true burden of disease. Preliminary work in this direction is under review. The archived version of the paper, with a training dataset and annotation guidelines as supplementary material, is available [34].
Although social media text mining research for health applications is still incipient, the domain has seen a surge in interest in recent years. Numerous studies have been published of late in this realm, including studies on pharmacovigilance [35], identifying user behavioral patterns [36], identifying user social circles with common experiences (like drug abuse) [37], monitoring malpractice [38], and tracking infectious/viral disease spread [39, 40]. Population and public health topics are most addressed, although different social networks may be suitable for specific targeted tasks. For example, while Twitter data has been utilized for surveillance and content analysis, a significant portion of research using Facebook has focused on communication rather than lexical content processing [41, 42]. For health monitoring and surveillance research from social media, the most common topic has been influenza surveillance [43, 44]. From the perspective of informatics and NLP, proposed techniques have typically been in the areas of data collection (e.g., keywords and queries) [45, 46], text classification [47, 48], and information extraction [49]. While innovative approaches have been proposed, there is still a lot of progress to be made in this domain.
Effective utilization of the health-related knowledge contained in social media will require a joint effort by the research community, and bringing together researchers from distinct fields including NLP, machine learning, data science, biomedical informatics, medicine, pharmacology, and public health. The knowledge gaps among researchers in these communities need to be reduced by community sharing of data and the development of novel applied systems.