SureChEMBL Patent Annotation
The patent corpus included in the SureChEMBL system includes full text from EPO, WO and US patent offices plus English abstracts of Japanese patents. All patents are fed through the chemistry annotation pipeline regardless of their classification (e.g. IPC codes). The current SureChEMBL chemistry pipeline identifies mentions of chemicals from the full text, figures and mol file attachments of patent documents using automated entity-recognition, name-to-structure and image-to-structure conversion algorithms. The extraction methods do not distinguish between claimed compounds, intermediates, reagents or other chemicals but do only extract fully defined compounds, not Markush scaffolds.
Patent and compound filters
For the Open PHACTS subset, patent documents with classification codes not included in the list below were flagged as non life-science-relevant and were subsequently removed.
- A01, A23, A24, A61, A62B
- C05, C06, C07, C08, C09, C10, C11, C12, C13, C14
The IPC, ECLA, IPCR, and CPC hierarchical classification systems were checked for the above codes. For more information on these codes, please see here.
Furthermore, a number of filtering rules are applied to remove some annotations of trivial, common or non drug-like chemicals. More specifically, compounds fulfil the criteria listed below:
- Must not be a radical
- Must have fewer than 4 components
- Must be organic
- Molecular weight must be between 100 and 6000
Annotations and relevance scores
In order to add annotation of target and disease information to the SureChEMBL patent corpus, the Termite text-mining engine from SciBite was used. Termite combines manually curated target and disease dictionaries with named-entity recognition methods. Termite was applied to all patents written in the English language and containing at least one chemical annotation. All identified targets and diseases were extracted for each patent and annotated with information indicating which sections they were found in, the frequency and a relevance score.
The sections of a patent document containing chemical or biological annotations are the following:
- Image (applicable to chemical annotations only and patents published after 2007)
- CWU Attachment (applicable to chemical annotations only and US patents published after 2007)
The relevance score was developed in order to help identify the 'key' entities within a patent document. For example many documents mention a large number of different proteins, but only one or two of these are actually the intended targets of the compounds claimed in the patent. By combining information regarding the reliability of the synonyms found, the document sections in which entities are identified and the frequency of occurrence, Termite is able to assign a relevance score between 0 and 3 to each target or disease annotation.
- A score of 0 (lowest relevance) is obtained when the target or disease is identified only by synonyms that Termite considers to be ambiguous. These results have a higher likelihood of being false positives and should be filtered out for most applications.
- A score of 1 is obtained when the entity is found with low frequency and does not appear in key sections of the patent.
- A score of 2 is obtained when the entity is mentioned in multiple document locations but still does not appear to be the main focus of the patent.
- A score of 3 (highest relevance) is obtained when the entity identified is likely to be a key subject of the patent e.g. the key target or disease for which the compounds claimed in the patent are intended.
The concept of relevance scoring was extended to include compounds. The main components of the compound relevance score is the global frequency (i.e. how many times a compound is mentioned in the whole patent corpus), the 2D similarity to known approved drugs (i.e. the drug-likeness), and, most importantly, the number of 2D nearest neighbours (i.e. structural analogues) a compound has within the compounds extracted from the same family of patents. Given these components, we are able to assign a relevance score between 0 and 3 to each chemical entity.
- A score of 0 (lowest relevance) is obtained when the compound has a very high global frequency or has no structural analogues in the same family of patents. These results have a higher likelihood of being irrelevant and should be filtered out for most applications.
- A score of 1 is obtained when the chemical entity has few (<10%) structural analogues in the same family of patents.
- A score of 2 is obtained when the chemical entity has a substantial number of analogues.
- A score of 3 (highest relevance) is obtained for the chemical entities with the highest number of structural analogues.
N.B., given that any data or text-mining of patent documents is inherently noisy, these scores are only intended as a guide. A good starting point for most purposes would be to include only annotations scoring 2 or 3.
Some initial testing of the accuracy of the annotation methods has been carried out, but this is not extensive due to the lack of an adequate gold-standard corpus. Comparison of chemical annotations with SciFinder and other tools has recently been published here.
Termite target annotations have been compared with manually assigned GVKBio targets for a set of 110 patents. In this set, 89% of the targets assigned by GVK were identified by Termite, 74% of them with high relevance (score 2 or 3).
Patent Annotation RDF
A custom data model was developed to represent the SureChEMBL patent annotation (SureChEMBL Core Ontology - scco). This is shown below. EBI URIs are assigned to the annotated molecules, targets, diseases and patents. The SureChEMBL data set is not currently hosted on the EBI-RDF platform, hence these URIs do not resolve at present. Each entity-patent association is represented by an association object which provides details of the document sections, frequency and relevance score associated with the annotation.
Compounds are assigned SureChEMBL identifiers as used in the SureChEMBL interface and download files. Please note these identifiers have no relation to ChEMBL identifiers, but the UniChem system can be used to cross-reference the two. The URIs provided take the following form:
Please note that SureChEMBL molecules are not yet loaded in the Open PHACTS chemical registry, so cannot currently be retrieved via OCRS IDs.
Targets are identified by HGNC symbols with URIs of the form:
Mappings from HGNC symbols to other gene/protein identifiers are available via the IMS through Ensembl linksets.
Diseases are identified by MeSH disease identifiers with URIs of the form:
Mappings to UMLS and Disease Ontology (DO) are available via DisGeNET link sets in the IMS. It should be noted that not all MeSH identifiers currently map to a disease in DO.
Patents are uniquely identified by patent numbers in a defined format. This should be the patent office code (e.g., EP, WO or US) followed by a hyphen, the patent number (no leading zeros), another hyphen and finally the kind code (e.g., A1, B2). The SureChEMBL interface provides a service to standardise and resolve other formats of patent numbers. For more information, see here.
URIs take the form:
The web service calls are available on API version 2.1: https://dev.openphacts.org/docs/2.1. These include:
- Patent Information - Retrieves bibliographic information for a patent document, e.g. title, publication date and classification codes.
- Patent Entities - Retrieves all annotations (compounds, genes and diseases) found in a patent document, along with their frequency of occurrence within the document, section and relevance score.
- Patent Entities: Count - Retrieves the number of entities mentioned in the patent specified.
- Patents for Compound: Count - Retrieves the number of patents a compound entity occurs in.
- Patents for Compound: List - Retrieves a list of patents a compound entity occurs in.
- Patents for Target: Count - Retrieves the number of patents a gene entity occurs in.
- Patents for Target: List - Retrieves a list of patents a gene entity occurs in.
- Patents for Disease: Count - Retrieves the number of patents a disease entity occurs in.
- Patents for Disease: List - Retrieves a list of patents a disease entity occurs in.
A number of KNIME workflows have been developed to demonstrate potential use-cases. These include API calls and subsequent filtering, processing and visualisation of the returned data. The workflows are available on request.