Neuroscience Information Framework

Home » Community News »  Slowly Maturing, Semantic Web Technologies Reach Pilot Project Stage in Pharma

Slowly Maturing, Semantic Web Technologies Reach Pilot Project Stage in Pharma

March 5, 2010

Semantic technology appears to be slowly finding its way into pharma's informatics toolbox, as proponents of the technology reported at a recent conference that they are moving beyond prototypes and into pilot projects that put semantic technologies to work.
At last week's Conference on Semantics in Healthcare and Life Sciences in Cambridge, Mass., scientists from pharma, vendor firms, and academia discussed opportunities for semantic technologies in life science research, as well as barriers to widespread adoption and strategies for swaying skeptical colleagues.
Several pharma researchers in attendance told BioInform that they are shifting from a more abstract sell of semantic technologies to producing usable and beneficial pilot projects.

Unlike previous CSHALS meetings, where the presentations were more about smaller prototype projects, this year's meeting showed "serious projects," said Ted Slater, assistant director of systems and knowledge discovery at Boehringer Ingelheim.
Despite some progress, however, many speakers noted that they are still seeing opposition among upper management in pharmaceutical firms, as well as resistance from colleagues in IT who are struggling with tightening budgets and workforce cuts and are skeptical about the return on investment for semantic tools
In one session a show of hands made it clear that most pharma scientists face an "uphill battle" when proposing semantic web projects, yet many said that the power of the technology will make these battles worthwhile.
One pharma scientist, who did not wish to be identified, said that large firms don't change easily, which presents hurdles in development. In particular, IT departments tend to distrust semantic approaches because they appear to be a big departure from current tools.

While traditional relational databases are akin to different human languages, semantic technologies are "a schema-less way of saying things," Slater said. Expressing concepts through semantic approaches, in subject-predicate-object triples, lets scientists reason over concepts and pose complex queries, "allowing us to ask questions we couldn't ask" otherwise, said Michael McGlashen, executive director of knowledge management in worldwide licensing and external research at Merck.
"To me it seems that most pharma companies are using semantic web to some extent," Susie Stephens, director of biomedical informatics in J&J's Pharmaceutical R&D division, told BioInform. She noted that this interest is being driven by the trend towards translational research, which requires an awareness of many data sources and the need to integrate, mine, and analyze data from different parts of the business and different industries.

Scientists applying semantic methods can save the time needed to buy a new server, design a database schema, fill it, and deploy it for every new dataset, Slater said. Once storage of subject-predicate-object triples is set up, a new database only requires downloading an RDF file into that triple store. "You load it and you're done," Slater said. "Now your data integration problem is gone."

Because semantic web technologies are standards-based, researchers can integrate disparate resources much faster and cheaper than with standard development tools, said Lee Feigenbaum, vice president of technology and standards at semantic web consulting firm Cambridge Semantics.

Resource Description Framework, or RDF, is the language used to represent information in the semantic web. RDF statements formed in the subject-predicate-object format, as triples, can be collected and searched as machine-readable graphs with each component of a triple tagged with a Uniform Resource Identifier.
There are several semantic syntaxes available, including RDF/XML, N-Triples, and N3. Turtle, or terse RDF Triple Language, is frequently used by developers as a somewhat more user-friendly alternative to RDF, Feigenbaum said. RDF graphs can be queried with the SPARQL query language. Many tools in this space are open source, such as D2R, a way for mapping relational databases with RDF.

Jim Hendler, a computer scientist from Rensselaer Polytechnic Institute and former chief scientist of the information systems office at the US Defense Advanced Research Projects Agency, said in his talk that semantic technology is no longer "deep academic cogitation" about a future application but is "maturing" technology that can sit on top of web infrastructure, is extensible, and oriented toward data-sharing.

Hendler cited signs of "commercial excitement" regarding semantic technology, including Microsoft's acquisition of semantic search engine Powerset for $100 million in 2008 and new semantic vendors such as Sandpiper, Intellidimension, Intellisophic, and Ontology Works that are emerging even though "this is not a friendly market for new companies."

Eric Neumann, director of the pharma consulting firm Clinical Semantics Group, highlighted that in many firms only a "small, small group" of people understand semantic approaches and enterprise IT groups remain skeptical. Hendler responded that perhaps not everyone in an IT group need be "deeply" involved in semantic projects, which, given the increasing availability of tools, can now be set up as a "show 'em process."

Living the Ecosystem

In his talk, Tim Schultz, senior analyst at J&J Pharmaceutical R&D IT systems engineering, presented the firm's semantic data pilot called knowIT, which has a focus on translational neuroscience and enables "novel translational queries" of available data.
J&J R&D has a "huge data ecosystem" that includes deep and shallow repositories and flat files on network shares, Schultz said. Stephens noted that when her team assessed J&J scientists' needs, they found that researchers "know about their favorite one or two data sources, but they didn't know the others."

The J&J team leveraged an existing semantic media wiki, used to catalog IT infrastructure at the company, and extended it to include metadata fields that describe data sources and can be accessed via SPARQL queries.

The wiki captures metadata about the data source, not the data itself, she said. It includes five tabs that describe the data source, the business owner, technical contact, licensing information, data source interface, and content captured with keywords from the Neuroscience Information Framework, or NIF, ontology.

"If there is an RDF representation of the data available, it includes the URL for the SPARQL endpoint," she said.

Stephens and her team are adding more data that has an RDF representation. "That's what makes it easier for scientists to do a quick dig-down into the data," she said.
One data source is from ADNI, the Alzheimer's Disease Neuroimaging Initiative, a consortium that J&J sponsors. The data relates to MRI scans that have been analyzed with a variety of tools, Stephens explained. "There are different tools to measure the same thing, so that does lead to the data becoming complex."

The team has integrated the system into its in-house R&D informatics platform 3DX, which has analysis and visualization capabilities. "We've extended 3DX to speak SPARQL, so then you can issue a query from 3DX that goes to knowIT," to show information about data sources within 3DX.

The team used the open source tool D2R to map ADNI data captured in relational databases to RDF, which took about a month. Mapping can be straightforward, such as when using resources such as DrugBank, which includes drug names and attributes, she said. "But, when you are looking at an experimental dataset, the mapping becomes more complicated." For example, a particular data set might involve an MRI on a particular date, with a machine of a particular type, on a patient whose brain has a particular hippocampal volume, measured with a particular analytical approach.

 

With premium subscription, you may access the full-text article.

Last updated: Friday, 30-Jul-2010 22:03:55 PDT

For general information, contact us at support@neuinfo.org


Principal Investigators:
Maryann Martone
maryann@ncmir.ucsd.edu

Amarnath Gupta
gupta@sdsc.edu


Jeffrey S. Grethe
jgrethe@ncmir.ucsd.edu

Project Manager:
Ashraf Memon
amemon@sdsc.edu
Curation:
Anita Bandrowski
abandrowski@ucsd.edu
External Relations/Web Support:
Lee G. Hornbrook
lee@ncmir.ucsd.edu