Start a new topic

How to check if two entities are the same, for many entities?

Network analysis on Open PHACTS data poses an interesting problem. When you start with, for example, a disease, find genes associated with this, find the pathways in which they are involved, give all other genes in those pathways, and the diseases associated with that. This does not even include drugs (drug-like compounds) that bind to the associated targets.

The API has no problem aggregating the information and it is a non-brainer to make a (star) network out of this. However, we want to no create a star network, but a network connected as much as possible. But various calls return a particular URI for each entity. Is it guaranteed that this URI is from a particular resources, e.g. the ConceptWiki? So, that this URI can be used to determine of the target is already present in the network, the proper edge is created, even if they have a different URI (which may or may not be the case); that is, we basically want to create cycles in the network.

Of course, one can always use the mapping service in the API to do this, but that creates many calls, one for each (potential) node in the network.

How are Open PHACTS API users approaching this situation?

I imagine that you when you say a "network connected as much as possible", you are implying that you would like to start with a target and get something like the all pathways where it is included, all its disease associations, and all the tissues where it is expressed. 

The various calls return the particular URI that is found in the dataset of that particular resource, I think this is what is guaranteed. There is no guarantee to have only one preferred "flavor" of the URI, but that should not matter in performing other queries. So for instance you can start with a ConceptWiki URI for a target and use it in the Target Information call.  Then you may get back information about that target that comes from DrugBank. In the block of information that comes from DrugBank, it contains the DrugBank URI. But, if you now want  the information about the tissue expression for that target, you should be able to use either the ConceptWiki or the DrugBank URI (or any other Target URI you have) in the Tissues for Protein API call. Even though there is no tissue information found in DrugBank, you can still use the DrugBank URI as a query, There is no real need to pass through the MapURL call unless you want to know all the URIs mapped to the specific URI that you are querying with.

The problem is basically we use the IMS on the input (so, indeed, the SPARQLing is fine), but not on the output. That means that the user of the API will typically get an arbitrary URL for a entity. That is fine, and typically the URI from the data source where the information came from. But, when creating a network, one needs a way to see if the "new" entity just returned by the last call, was actually returned before. For this, consider the following sequence of calls:

  1. targets for a compound
  2. find pathways for those targets
  3. give all targets in that compounds (some already found)
  4. give all compounds for those targets (even more already previously found)

Now, in the course in Nov/Dec we had students even longer chains of calls.

It might be possible to solve this like you suggested. But in general for network biology you will have to use not only the IMS on the output but also either the IMS from Cytoscape or the BridgeDb Cytoscape app on the network itself, assuming that you do not only use Open PHACTS for input any other resource will cause the same kind of problem otherwise. Since you mention students I am a bit hesitant to provide them from a solution that only works because we engineered it and that will not work in network biology in general.

Login or Signup to post a comment