Identifying Locations with Possible Undetected Imported Severe Acute Respiratory Syndrome Coronavirus 2 Cases by Using Importation Predictions


Cases of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection exported from mainland China could lead to self-sustained outbreaks in other countries. By February 2020, several countries were reporting imported SARS-CoV-2 cases. To contain the virus, early detection of imported SARS-CoV-2 cases is critical. We used air travel volume estimates from Wuhan, China, to international destinations and a generalized linear regression model to identify locations that could have undetected imported cases. Our model can be adjusted to account for exportation of cases from other locations as the virus spreads and more information on importations and transmission becomes available. Early detection and appropriate control measures can reduce the risk for transmission in all locations.

A novel coronavirus, later named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was identified in December 2019 in the city of Wuhan, capital of Hubei Province, China, where cases were first confirmed (1). During December 2019–February 2020, the number of confirmed cases increased drastically. Model estimates suggested that >75,000 persons were infected by January 25, 2020, and the epidemic had a doubling time of ≈6 days (2). By the end of January 2020, travel restrictions were implemented for Wuhan and neighboring cities. Nonetheless, the virus spread from Wuhan to other cities in China and outside the country. By February 4, 2020, a total of 23 locations outside mainland China reported cases, 22 of which reported imported cases; Spain reported a case caused by secondary transmission (3).

Most cases imported to other locations have been linked to recent travel history from China (3), suggesting that air travel plays a major role in exportation of cases to locations outside of China. To prevent other cities and countries from becoming epicenters of the SARS-CoV-2 epidemic, substantial targeted public health interventions are required to detect cases and control local spread of the virus. We collected estimates of air travel volume from Wuhan to 194 international destinations. We then identified 49 countries that had a score of >49.2/100 on category 2, Early Detection and Reporting of Epidemics of Potential International Concern, of the Global Health Security (GHS) Index (4). We assumed these locations would be proficient at detecting SARS-CoV-2 and reporting confirmed imported cases, which we refer to as imported-and-reported cases. We ran a generalized linear regression model on this subset; based on the results, we generated predictions for the remainder of the sample. Using these predictions, we identified locations that might not be detecting imported cases


To identify locations reporting fewer than predicted imported SARS-CoV-2 infected cases, we fit a model to data from 49 locations outside mainland China with high surveillance capacity according to the GHS Index (4). Among these, 17 had high travel connectivity to Wuhan and 32 have low connectivity to Wuhan. We considered locations to be countries without any position on territorial claims. We performed a Poisson regression by using the cumulative number of imported-and-reported SARS-CoV-2 cases in these 49 countries and the estimated number of daily airline passengers from the Wuhan airport. We then compared predictions from this model with imported-and-reported cases across 194 locations from the GHS Index, excluding China as the epicenter of the outbreak.

The model requires data on imported-and-reported cases of SARS-CoV-2 infection, daily air travel volume, and surveillance capacity. We obtained data on imported-and-reported cases aggregated by destination from the World Health Organization technical report issued February 4, 2020 (3). We assumed a case count of 0 for locations not listed. We used February 4 as the cutoff for cumulative imported-and-reported case counts because exported cases from Hubei Province dropped rapidly after this date (3), likely because of travel restrictions for the province implement on January 23. We defined imported-and-reported cases as those with known travel history from China; of those, 83% had a travel history from Hubei Province and 17% traveled from unknown locations in China (3). We excluded reported cases likely caused by transmission outside of China or cases in which the transmission source was still under investigation (3). In addition, we excluded Hong Kong, Macau, and Taiwan from our model because locally transmitted and imported cases were not disaggregated in these locations.

We obtained data on daily air travel from a network-based modeling study (S. Lai et al., unpub. data, Link) that reported monthly air travel volume estimates for the 27 locations outside mainland China that are most connected to Wuhan. These estimates were calculated from International Air Travel Association data from February 2018, which includes direct and indirect flight itineraries from Wuhan. For these 27 locations, estimated air travel volumes are >6 passengers/day. We assumed that travel volumes for locations not among the most connected are censored by a detection limit. We used a common method of dealing with censored data from environmental sampling (5), or metabolomics (6), to set the daily air travel volume to half the minimum previously reported. Therefore, we used 3 passengers/day for estimated travel volumes for the 167 locations from the GHS Index not listed by Lai et al. We tested the robustness of our results by using a set of alternative values of 0.1, 1, and 6 passengers/day for the censored data.

We defined high surveillance locations as those with a GHS Index for category 2 above the 75th quantile. We assessed the number of high surveillance locations, those with 0 imported-and-reported cases, and low surveillance locations, those with case counts >1 (Table).

For our model, we assumed that the cumulative imported-and-reported case counts across 49 high surveillance locations follow a Poisson distribution from the beginning of the epidemic until February 4, 2020. Then the expected case count is linearly proportional to the daily air travel volume in the following formula:where i denotes location, Ci denotes the imported-and-reported case count in a location, λi denotes the expected case count in a location, β denotes the regression coefficient, and xi denotes the daily air travel volume of a location. The Poisson model assumes cases are independent and that the variance is equal to the expected case count. Imported-and-reported cases likely meet the independence assumption because the value excludes cases with local transmission. We also checked the robustness of our results by using an over dispersed model with a negative binomial likelihood. We computed the p value of the overdispersion parameter as shown in Gelman and Hill (7).

Thumbnail of Regression plot of locations with possible undetected imported cases of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) by air travel volume from Wuhan, China. Air travel volume measured in number of persons/day. No. cases refers to possible undetected imported SARS-CoV-2 cases. Solid line indicates the expected imported-and-reported case counts for locations. Dashed lines represent 95% prediction interval bounds smoothed for all locations. Purple dots indicate locationFigure 1. Regression plot of locations with possible undetected imported cases of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) by air travel volume from Wuhan, China. Air travel volume measured in number of…

We used R version 3.6.1 (https://www.r-project.orgExternal Link) to compute , the maximum likelihood estimate of β, and the expected imported-and-reported case count given high surveillance (Figure 1). We also computed the 95% prediction interval (PI) bounds under this model of high surveillance for all 194 values of daily air travel volume (Figure 1). First, we generated a bootstrapped dataset by sampling n locations with replacement among high surveillance locations. Then, we reestimated β by using the bootstrapped dataset. Finally, we simulated imported-and-reported case counts for all 194 locations under our model by using the estimate of β from the bootstrapped dataset. We repeated the 3 steps 50,000 times to generate 50,000 simulated imported-and-reported case counts for each of the locations computed to the lower and upper PI bounds (PI 2.5%–97.5%). We smoothed the 95% PI bounds by using ggplot2 in R (8). We fit the imported-and-reported case counts of the 49 high surveillance locations to the model and plotted these alongside 145 locations with low surveillance capacity (Figure 1). We noted some overlap between high and low surveillance locations (Figure 1).

Thumbnail of Analyses of imported-and-reported cases and daily air travel volume using a model to predict locations with potentially undetected cases of severe acute respiratory virus 2 (SARS-CoV-2). Air travel volume measured in number of persons/day. No. cases refers to possible undetected imported SARS-CoV-2 cases. Solid line shows the expected imported-and-reported case counts based on our model fitted to high surveillance locations, indicated by purple dots. Dashed lines indicate the 95% prFigure 2. Analyses of imported-and-reported cases and daily air travel volume using a model to predict locations with potentially undetected cases of severe acute respiratory virus 2 (SARS-CoV-2). Air travel volume measured in…

To assess the robustness of our results we ran 8 additional regression analyses by implementing a series of changes to the analysis. The changes included the following: set the daily air travel volume to 0.1, 1, or 6 passengers/day for locations not listed by Lai et al. (unpub. data, Link) (Figure 2, panels A–C); removed all locations not listed by Lai et al. before fitting (Figure 2, panel D); defined high surveillance locations by using a more lenient GHS Index criterion, 50th quantile (Figure 2, panel E), and a more stringent criterion, 95th quantile (Figure 2, panel F); excluded Thailand from the model because it is a high-leverage point (Figure 2, panel G); or used an overdispersed Poisson likelihood with a negative-binomial likelihood (Figure 2, panel H). We provide code for these analyses on GitHub ( Link).


We found that daily air travel volume positively correlates with imported-and-reported case counts of SARS-CoV-2 infection among high surveillance locations (Figure 1). We noted that increasing flight volume by 31 passengers/day is associated with 1 additional expected imported-and-reported case. In addition, Singapore and India lie above the 95% PI in our model; Singapore had 12 more imported-and-reported cases (95% PI 6–17 cases) than expected and India had 3 (95% PI 1–3 cases) more than expected. Thailand has a relatively high air travel volume compared with other locations, but it lies below the 95% PI, reporting 16 (95% PI 1–40 cases) fewer imported-and-reported cases than expected under the model. Indonesia lies below the PI and has no imported-and-reported cases, but the expected case count is 5 (95% PI 1–10 cases) in our model. Across all 8 robustness regression analyses, we consistently observed that Singapore lies above the 95% PI and Thailand and Indonesia lie below (Figure 2). India remains above the 95% PI in all robustness analyses except when we used the more stringent GHS Index, 95th quantile, for fitting; then India lies on the upper bound of the 95% PI (Figure 2, panel F).


We aimed to identify locations with likely undetected or underdetected imported cases of SARS-CoV-2 by fitting a model to the case counts in locations with high surveillance capacity and Wuhan-to-location air travel volumes. Our model can be adjusted to account for exportation of cases from locations other than Wuhan as the outbreak develops and more information on importations and self-sustained transmission becomes available. One key advantage of this model is that it does not rely on estimates of incidence or prevalence in the epicenter of the outbreak. Also, we intentionally used a simple generalized linear model. The linearity of the expected case count means that we have only 1 regression coefficient in the model and no extra parameters. The Poisson likelihood then captures the many 0-counts observed for less highly connected locations but also describes the slope between case-count and flight data among more connected locations. We believe this model provides the most parsimonious phenomenologic description of the data.

According to our model, locations above the 95% PI of imported-and-reported cases could have higher case-detection capacity. Locations below the 95% PI might have undetected cases because of expected imported-and-reported case counts under high surveillance. Underdetection of cases could increase the international spread of the outbreak because the transmission chain could be lost, reducing opportunities to deploy case-based control strategies. We recommend rapid strengthening of outbreak surveillance and control efforts in locations below the 95% PI lower bound, particularly Indonesia, to curb potential local transmission. Early detection of cases and implantation of appropriate control measures can reduce the risk for self-sustained transmission in all locations.

Dr. De Salazar is a research fellow at Harvard T.H. Chan School of Public Health, working on multiscale statistical models of infectious diseases within host, population, and metapopulation models. His research interests include diagnostic laboratory methods and public health response.


We thank Pamela Martinez, Nicholas Jewel, and Stephen Kissler for valuable feedback.

This work was supported by US National Institute of General Medical Sciences (award no. U54GM088558). P.M.D was supported by the Fellowship Foundation Ramon Areces. A.R.T. and C.O.B. were supported by a Maximizing Investigator’s Research Award (no. R35GM124715-02) from the US National Institute of General Medical Sciences.

The authors are solely responsible for this content and it does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health.

Declaration of interests: Marc Lipsitch has received consulting fees from Merck. All other authors declare no competing interests.



  1. Zhou  PYang  XLWang  XGHu  BZhang  LZhang  Wet alA pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature2020;579:2703.
  2. Wu  JTLeung  KLeung  GMNowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. Lancet2020;395:68997DOIExternal LinkPubMedExternal Link
  3. World Health Organization. Coronavirus disease 2019 (COVID-19) situation report—15, 4 Feb 2020 [cited 2020 Feb 14]. Link
  4. Nuclear Threat Initiative and Johns Hopkins Center for Health Security. Global health security index [cited 2020 Feb 14]. https://www.ghsindex.orgExternal Link
  5. US Environmental Protection Agency. Data quality assessment: statistical methods for practitioners EPA QA/G9-S [cited 2020 Feb 14]. Washington: The Agency; 2006 Link
  6. Lamichhane  SSen  PDickens  AMHyötyläinen  TOrešič  M. An overview of metabolomics data analysis: current tools and future perspectives. In: Jaumot J, Bedia C, Tauler R, editors. Comprehensive analytical chemistry. Vol. 82. Amsterdam: Elsevier; 2018. p. 387–413.
  7. Gelman  AHill  J. Analytical methods for social research. In: Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press; 2006. p. 235–236.
  8. Wickham  H. ggplot2: elegant graphics for data analysis. New York: Springer; 2016.

Read the Full Original Article at