Special Interest Groups

Forum Navigation
You need to log in to create posts and topics.

What is the major role in this pandemic situation for data scientists?

Data science can already provide ongoing, accurate estimates of health system demand, which is a requirement in almost all reopening plans. We need to go beyond that to a dynamic approach of data collection, analysis, and forecasting to inform policy decisions and optimize public health recommendations for re-opening. While most reopening plans propose extensive testing, contact tracing, and monitoring of population mobility, almost none consider setting up such a dynamic feedback loop. Having such feedback could determine what level of virus activity can be tolerated in an area, given regional health system capacity, and adjust population distancing accordingly.

By using existing technology and some data science, it is possible to set up that feedback loop, which would maintain healthcare demand under the threshold of what is available in a region. This is an opportunity for the data and tech community to partner with healthcare experts and provide a measure of public health planning that governments are unable to do. Therefore, the question is: How can data science help forecast regional health system resource needs given measurements of virus activity and suppression measures such as population distancing?

 For the data science effort to work, first and foremost, we need to fix delays in data collection and access introduced by existing reporting processes. Currently, most departments of public health are collecting and reporting metrics that are not helpful, and are reporting them with 48 hour delays, and often with errors. At the present time, due to time lags in confirming and reporting cases and a failure to distinguish between current and cumulative hospitalizations, even regions that report hospitalization data often provide only a blurry picture of the burden on the regional health system. Regions should ideally report both suspected and confirmed hospital cases and indicate the date of admission, in addition to the date of report or confirmation.

 Even with perfect reporting, there are fundamental delays in what such data can tell us. For example, new admissions to a hospital today reflect virus activity as of 9 to 13 days ago. Not factoring in such considerations have led to significant over-estimation of hospitalization needs nationwide. We therefore need to measure virus activity via proxy measures that are indicative early in the lifecycle of the virus. We must benchmark these against the number of new and total COVID-19 hospitalizations as well as ideally the number of new infections, assuming it is accurately measured through large scale testing. Available proxy measures include test positivity rates in health systems, case counts, deaths and perhaps seropositivity rates. Ongoing symptom tracking via smartphone apps, daily web or phone surveys, or cough sounds can identify potential hotspots where virus transmission rates are high. Contact tracing, which currently requires significant human effort, can also help tracking of potential cases if it can be scaled using technology under development by major American tech companies.

 With reliable tracking and benchmarking in place, we can calculate infection prevalence as well as daily growth and transmission rates, which is essential for determining if policies are working. This is a problem not only of data collection but also data analysis. Issues of sensitivity, daily variability, time lags, and confounding need to be studied before such data can be used reliably.

 We then need to estimate the regional effects of policy interventions such as shelter-in-place orders (via mobility reduction) and contact tracing (via reductions in new cases), first as simple forecasts and eventually maturing to what-if analyses.

 Once the ability to project from mobility to transmission to health system burden is constructed, we can “close the loop” by predicting how much mobility we can afford given measured virus activity and anticipated health system resources in the next two weeks. Researchers have already attempted to calculate “tolerable transmission” in the form of maximum infection prevalence in a given geography that would not overload health systems. Coupling such tolerable transmission estimates with daily assessments of a valid sample of the population (via testing, via daily surveys, via electronic health record-based surveillance) would allow monitoring of changes in transmission which can alert us to the need to intervene, such as by reducing mobility. As new measures such as contact tracing cut transmission rates, these same monitoring systems can tell us that it is safe to increase mobility further. Continuously analyzing current mobility as well as virus activity and projected health system capacity can allow us to set up “keep the distance” alerts that trade off tolerable transmission against allowed mobility. Doing so will allow us to intelligently balance public health and economic needs in real time.

 Concretely, then, the crucial “data science” task is to learn the counterfactual function linking last week’s population mobility and today’s transmission rates to project hospital demand two weeks later. It is unclear how many days of data of each proxy measurement we need to reliably learn such a function, what mathematical form this function might take, and how we do this correctly with the observational data on hand and avoid the trap of mere function-fitting. However, this is the data science problem that needs to be tackled as a priority.

 Adopting such technology and data science to keep anticipated healthcare needs under the threshold of availability in a region requires multiple privacy trade-offs, which will require thoughtful legislation so that the solutions invented for enduring the current pandemic do not lead to loss of privacy in perpetuity. However, given the immense economic as well as the hidden medical toll of the shutdown, we urgently need to construct an early warning system that tells us to enhance suppression measures if the next COVID-19 outbreak peak might overwhelm our regional healthcare system. It is imperative that we focus our attention on using data science to anticipate, and manage, regional health system resource needs based on local measurements of virus activity and effects of population distancing

    3AI is a not-for-profit association founded with an endeavor of creating the largest community of AI & Analytics professionals
    Follow us on