The health benefits of Internet data mining

Robin Wulffson MD's picture
drug side-effects, morbidity, mortality, hyperglycemia, paroxetine, pravastatin

Many Americans who are concerned of the side-effects of a medication rely on information from the Food and Drug Administration (FDA); however, a new study has found a better method: an Internet search. Investigators affiliated with Microsoft Research (Redmond, Washington), Columbia University (New York, New York), and Stanford University (Stanford, California) published their results online on March 6 in the Journal of the American Medical Informatics Association.

The study authors noted that adverse drug events cause substantial morbidity and mortality; furthermore, these side-effects often are discovered after a drug comes to market. They theorized that Internet users may locate early clues about adverse drug events via their online information-seeking. Therefore, they conducted a large-scale study of Web search log data gathered during 2010. The study was based on data-mining techniques similar to those employed by services such as Google Flu Trends, which has been used to give early warning of the prevalence of the sickness to the public. The investigators paid particular attention to the specific drug pairing of paroxetine and pravastatin. Paroxetine is an antidepressant; pravastatin is a cholesterol lowering drug. The researchers found evidence that the combination of the two drugs caused hyperglycemia (high blood sugar).

They also examined sets of drug pairs known to be associated with hyperglycemia and those not associated with hyperglycemia. The researchers noted that using data derived from queries entered into Google, Microsoft, and Yahoo search engines, they were abler to detect evidence of unreported prescription drug side effects before they were found by the FDA’s warning system. The investigators used automated software tools to examine queries by six million Internet users taken from Web search logs in 2010, the researchers looked for searches relating to an antidepressant, paroxetine, and a cholesterol lowering drug, pravastatin. .

The FDA uses a system known as the Adverse Event Reporting System, which asks physicians to report drug side effects; however, its effectiveness is limited by the fact that data is generated only when a physician notices something and reports it. The study authors note that their approach was a refinement of work done by the laboratory of Russ B. Altman, the chairman of the Stanford bioengineering department. The group had explored whether it was possible to automate the process of discovering “drug-drug” interactions by using software to sift through the data found in FDA reports. In May 2011, the Stanford researchers reported that it was able to detect the interaction between paroxetine and pravastatin in this way. Its research determined that a patient’s risk of developing hyperglycemia was increased compared with taking either drug individually. The effect is known as synergism, where two different combined entities exert an effect much greater than either one separately.


Dr. Altman theorized that there might be a more immediate and more accurate way to gain access to data similar to what the FDA had access to. Therefore, he collaborated with Microsoft computer experts who created software for scanning anonymized data collected from a software toolbar installed in Web browsers by users who permitted their search histories to be collected. Anonymized data is information that is provided without revealing its source; thus, anonymous. The researchers were able to examine 82 million individual searches for drug, symptom, and condition information.

The investigators first identified individual searches for the terms paroxetine and pravastatin, as well as searches for both terms, in 2010. They then computed the likelihood that users in each group would also search for hyperglycemia as well as roughly 80 of its symptoms (i.e., words or phrases such as “high blood sugar,” “blurry vision,” or “dizziness.” They found that individuals who searched for both drugs during the 12-month period were significantly more likely to search for terms related to hyperglycemia than were those who searched for just one of the drugs (about 10% vs. 4-5% for just one pharmaceutical). They also found that individuals who did the searches for symptoms relating to both drugs were likely to do the searches in a short time period: 30% did the search on the same day, 40% during the same week, and 50% during the same month.

The investigators explained that, compared to analyses of other sources such as electronic health records (EHR), logs are inexpensive to collect and mine. They noted that their results demonstrate that logs of the search activities of populations of computer users can contribute to drug safety surveillance. They recommended that the FDA should consider adding their technique to its current system for tracking adverse effects. They wrote: “There is a potential public health benefit in listening to such signals and integrating them with other sources of information.”

The researchers said that, going forward, they were considering the development of a method to add new sources of information, such as behavioral data and information from social media sources. They stressed that they would make every effort to insure that their data mining preserved individual privacy.

The FDA is currently involved in improving its data gathering. The organization has financed the Sentinel Initiative, which was launched in 2008 to assess the risks of pharmaceuticals already on the market. Eventually, that project plans to monitor drug use by as many as 100 million people in the nation. The system will be based on information collected by healthcare providers on a massive scale.

Reference: Journal of the American Medical Informatics Association