Skip to main content

DO Patent: AI-Powered Tool to Convert Chemical Images into SMILES and Curate your Data

1. Overview

DO Patent is an AI tool that identifies and converts chemical images within patents and other PDF documents into SMILES strings (a convenient format for subsequent data processing).

2. Getting Started

  • Account Creation/Login: Sign up for an account here or log in to your existing account.
  • Select products of interest: Upon logging in, you will land on the Products page (aka the main page in Settings) where you can select which product you want to use

image.png

  • Initial DO Patent Interface: Upon logging in, you’ll be presented with the DO Patent interface shown below, which is organized into distinct panels (described below).

image.png

  • DO Patent selection: If the interface above is not what you see, click on DO Patent in the left menu to switch to DO Patent app

image.png


3. DO Patent - User Interface and Functionality

DO Patent offers a simple solution for converting chemical images within documents of interest (patents, journal articles) into SMILES strings, a universal format for encoding small molecules. The tool also provides the original images extracted from PDF documents and confidence scores for each recognition event to make data curation straightforward.

3.1 Jobs view

Jobs view consists of two segments:

  • Upload panel
  • Jobs list

The upload panel allows you to input the desired documents for analysis while the Jobs list gives key identifiers to locate the desired jobs, see the jobs status, and execute necessary operations.

image.png

3.1.1 Upload panel and input parameters

You can upload PDF files either by dragging files into upload panel or by clicking on “Upload files” link and selecting the PDF document in the file browser.

image.png

Key input parameters are outlined in the list below:

  • Document type: Any document (patent or journal article) with chemical images
  • Input format: PDF only
  • Batch input: Supported
  • Size limit: 1 Gb per document
  • Color support: DO Patent supports analysis of colored chemical images. The best results are achieved with patent-like documents known for length and grainy black & white images

3.1.2 Jobs list

The jobs list allows you to see a list of past and active jobs and execute certain operations with active and completed processes (see below). It is sorted by default in newest to oldest order.

image.png

“Uploaded File” and “Upload date” allow you to quickly identify documents that you need.

“Status” and “Page Analysis” give you parameters to monitor the process (see below for more details).

“View results” button (appears as soon as document processing is complete) will open the Results View (see below) where you can view all processed molecules and edit them, if necessary.

Cancel button is shown only for active processes. It aborts the active process.

Export All and Export Selected allow you to download the processed document results (see below).

3.1.3 Execution Parameters

Key execution parameters are outlined in the list below:

  • Time to completion: It takes about 30 min to process an average patent (120 pages, black & white). The actual completion time depends on document resolution, density of chemical images and presence of additional colors.
  • Parallel processing: Supported. Multiple patents can be uploaded for processing at once, however total time to finish may vary.

During document analysis, DO Patent can generate the following states:

  • In Queue: The file has been successfully uploaded and is waiting for analysis to start.
  • Processing: The file is being analyzed by our AI engine.
  • Completed: The document analysis was successful and the resulting Excel file is ready for download.
  • InsufficientFunds: This status is likely caused by achieving one of two limits:
  • Failed: Failed runs are rare. Please, contact us at support@deeporigin.com if you experience a failed job. Often errors are just flukes caused by cosmic rays that are resolved by resubmission. We do not count pages for failed runs.
  • Cancelled: Document processing has been cancelled manually by the user.

image.png

3.1.4 Job cancellation

The cancel button is shown only for active processes. It aborts the active process. Partial charges will be posted to your account according to the number of processed pages. Partial results can be exported with Export All and Export Selected buttons or can be inspected and edited by clicking on View Results.

image.png

3.1.5 Export from the Jobs View

There are two options to export for the Jobs View:

  • Export all: Export all jobs on the active jobs page as separate Excel files. The download will start automatically once “Export all” is clicked with one Excel file per processed document.
  • Export selected: Export selected jobs as Excel files. The “Export selected” button appears once you have selected at least one job. The download of all selected jobs will start automatically once “Export selected” is clicked, with one Excel file per processed document.

image.png

3.2 Results View

“View Results” button shows up in the Jobs view as soon as the document is successfully processed. Clicking on this button will open the Results View where you can see a full list of recognized molecules and curate your data.

The results view consist of the following segments:

  • Navigation, structural modes and export
  • Table
  • PDF viewer
  • DO Draw (see below)

image.png

3.2.1 Navigation

Basic navigation shows the name of the opened document and a chevron for returning to the Jobs view.

image.png

image.png

3.2.2 Structural modes

Filters control the content shown in the table or how molecules are visualized:

  • Full molecules: full molecules are shown in the table when this mode is activated. Molecules without variable substituents or open valences are classified as “Full Molecules”
    • Note: DO Patent shows by default only Full Molecule
  • Fragments: fragments and Markush structures are shown in the table when this mode is activated. Any molecule with variable substituents or open valences are classified as “Fragments”
    • Note: it is rare for our algorithm to misclassify full molecules as fragments but possible. We recommend always checking fragments with high confidence scores (>0.92) if they were misclassified.

image.png

  • Kekulization toggle: fragments containing aromatic rings can be toggled between kekulized representation (circle) and non-kekulized representation (alternating single and double bonds)
    • Note: if you have a mix of kekulized and non-kekulized structures in the recognized document, the kekulization operation will be applied to the non-kekulized fraction of molecules in an irreversible fashion

image.png

3.2.3 Export from the Results View

There are ways to export data from the Results view:

  • Export all: Export all rows as an Excel file. The download will start automatically after “Export all” is clicked.
  • Export selected: Export selected rows as an Excel file. “Export selected” button appears once you select at least one row. The download will start automatically after “Export selected” is clicked.

image.png

3.2.4 Table View

Table view allows you to curate your data and edit molecular structures.

The table view consists of the following columns:

  • ID
    • This column approximates the order in which molecules and fragments appear in the document
  • Original image
    • This column contains the image our algorithm recognized as a molecule and extracted in the document. Direct comparison between the original image and the recognized molecule facilitates data curation.
      • Note: Clicking on the row will show the exact page in the PDF viewer from which the molecule was extracted so you can review broader context associated with the molecule.
  • 2D-to-SMILES toggle
    • This toggle allows to switch between 2D rendering of recognized SMILES and the SMILES format for each recognized molecule

image.png

  • Confidence score
    • This score shows the confidence level of our AI engine for each recognized molecule. There are three confidence levels:
      • High (>0.98 confidence score) - Likely accurate structure
        • Note: Fraction of molecules with high confidence score depends on the document type and formatting. On average, 73% of molecules in the US patents are recognized with the high confidence score.
        • Note: See the DO Patent accuracy section for more details about segmentation and recognition accuracies. Molecules with “High” confidence tag show 97.3% accuracy per molecule and 99.9% accuracy for individual structural elements (atoms and bonds). The molecule was considered inaccurate when either one atom or one bond was recognized incorrectly.
      • Medium (0.92-0.98 confidence score) - Needs manual review
        • Note: The fraction of molecules with medium confidence scores depends on the document type and formatting. On average, 22% of molecules in the US patents in our test set are recognized with the medium confidence score.
        • Note: Recognition accuracy is highly dependent on document formatting (see 3.5.3 Recognition accuracy - Medium confidence)
      • Low (<0.92 confidence score) - Consider discarding
  • Page
    • This column lists the page in the document from which its molecular structure was identified and recognized
    • Note: You can also click on the row to see the exact page in the PDF viewer

By default, the table is sorted by the page number. However, you can sort the page by structure ID or confidence score.

3.2.5 PDF viewer

The PDF viewer shows the source document from which recognized molecules originated. You can navigate to a specific page by entering the page number or by clicking on the left and right chevrons.

image.png

Selecting a specific row in the table navigates to the specific page in the document where the molecule in the row was found. This function is helpful if you need to review additional information related to the table.

image.png

3.3 Chemical structure and SMILES editing

DO Patent allows users to edit molecular structures and SMILES strings, if necessary.

3.3.1 Editing chemical structures with DO Draw

Double clicking on a 2D-rendered image of the molecule will open DO Draw, Deep Origin’s molecular editor.

image.png

image.png

The editor contains necessary tools and several shortcuts for common moieties to facilitate editing of molecular structures. Editor tools are grouped in several menus:

  • Export and bulk editing tools: Export molecule on the canvas, undo/redo operations, aromatize/dearomatize operations, clean up structure, calculate R/S designation, add/remove explicit hydrogen.
  • Selection and deletion tools: Hand tool, rectangle selection, lasso selection, fragment selection, eraser.
  • Bond types: Single bond, double bond, triple bond, single bond up, single bond down, hydrogen bond.
    • Additional bond types are also available under the hydrogen bond expansion menu - aromatic bond, dative bond, any bond, undefined single bond, undefined double bond.
  • Charges and stereochemistry: Chain tool, advanced stereochemistry, positive charge, negative charge.
  • Common rings: Benzene, cyclopentadiene and three- to eight-membered aliphatic rings.
  • Atom types: Common atoms and periodic table

image.png

List of available shortcuts is accessible via the shortcuts button and also below:

image.png

Clicking on the “Save” button will update the record. This is an irreversible change. Edited molecules will receive “edited” tag in the confidence column that will replace the original confidence score

image.png

Clicking on “Cancel” button will discard any changes made to the molecule.

3.3.2 Editing SMILES strings

Switching from 2D rendered molecules to SMILES strings will visualize SMILES strings. Double clicking on the SMILES string will open text editor within the cell.

image.png

3.4 Export format

Processed data can be extracted from either the Jobs view (see section 3.1.5 Export from the Jobs view) or the Results view (see section 3.2.3 Export from the Results view). The data is exported in the .xlsx format (Excel). The size of file can reach 100 Mb and depends on the number of chemical images in the processed PDF document.

The output format is optimized for quick data curation and subsequent import into external databases and software solutions.

DO Patent output 2.png

The resulting .xlsx file has the following columns:

  • Structure ID: The order in which this molecule appears in the document
  • Extracted Image: The original image in the PDF document as the algorithm recognized it.
  • Predicted structure: 2D rendering of a chemical structure encoded in a SMILES string (see below).
  • Confidence: Confidence score indicating accuracy of recognition and the need for manual data review. We recommend sorting results by the confidence score. See details below how confidence score is calculated.
    • >0.98 confidence score: high likelihood of accurate recognition
    • 0.92-0.98 confidence score: manual review is needed
    • <0.92 confidence score: poor recognition, consider discarding result
  • Confidence details: Specific recognition tokens forming the confidence score from the elements of the molecular structure.
  • SMILES: 1D representation of the molecule predicted by the algorithm. This is a standard format for data import across all scientific software solutions.
  • Source: Name of the original PDF document.
  • Page: Page number of the recognized image of the molecule.

3.5 DO Patent accuracy

DO patent consists of two systems: segmentation and recognition. Segmentation module identifies and classifies images that contain molecules. Recognition module looks at the image trying to predict SMILES string that would fit the image.

This accuracy analysis was conducted by an experienced medicinal chemist manually looking at each page of a PDF document and comparing it to the segmentation and recognition results. 25 random US patents were selected for this exercise to capture diversity of formatting styles. Criteria for selection were type of patent, document size, filing company, market share of a drug and therapeutic modality.

3.5.1 Segmentation accuracy

During the segmentation accuracy analysis, extracted non-chemical images and images containing more than one chemical entity were considered “false positives”. Chemical images present in the PDF document but missing in the Results table were considered “false negatives”.

patent IDentitycompanynumber of pagesnumber of structuresnumber of false positivesegmentation accuracy, %number of false negativesegmentation accuracy, %
US7838499 B2BrenzavvyTheracos74330399.1%0100.0%
US2022/0324863 A1Clinical candidate for Leishmaniasis Novartis1357421298.4%699.2%
US9447106 B2BrukinsaBeiGene225815899.0%0100.0%
US8410103 B2CabenuvaShionogi94411798.3%399.3%
US8039627 B2IngrezzaNeurocrine1825676.0%0100.0%
US9592208 B2GilenyaNovartis920100.0%0100.0%
US8324208 B2JesduvroqGSK65164199.4%199.4%
US8324225 B2KisqaliNovartis1317061098.6%0100.0%
US11351149 B2PaxlovidPfizer1695541996.6%0100.0%
US8129385 B2DovatoShionogi92414798.3%0100.0%
US7964580 B2EpclusaPharmasset256342199.7%199.7%
US7598257 B2JakafiIncyte190971399.7%1298.8%
US10342780 B2JaypircaLoxo179872699.3%0100.0%
US8207125 B2KyprolisOnyx38112199.1%0100.0%
US9617258 B2LitfuloPfizer142444598.9%0100.0%
US8937150 B2MavyretAbbVie32319935197.4%0100.0%
US7390791 B2OdefseyGilead2991495.6%0100.0%
US7342118 B2OgsiveoPfizer4762395.2%198.4%
US8486941 B2OjjaaraYM Biosciences65104298.1%0100.0%
US8158616 B2OlumiantIncyte79223299.1%299.1%
US7427638 B2OtezlaAmgen2414750.0%0100.0%
US10406240 B2PluvictoPurdue U79207796.6%697.1%
US8101623 B2TruqapAstraZeneca83233498.3%0100.0%
US8754096 B2UbrelvyMerck341150100.0%0100.0%
US9309245 B2XacduroEntasis107432299.5%0100.0%
Total26871037817198.4%3299.7%

3.5.2 Recognition accuracy - High confidence (>0.98 score)

Recognition accuracy was assessed only for full molecules (molecules without open valences or variable ligands). A molecule with a single error in a bond or an atom was considered an recognition error. Recognition of individual elements (atoms and bonds) were estimated from number of high confidence molecules and number of recognition errors. The vast majority of molecules with high confidence scores carried a single individual element that was recognized with an error.

patent IDentitycompanynumber of pagesnumber of full moleculesnumber of high confidence moleculesfraction of high confidence molecules, %number of errors of high confidence moleculesrecognition accuracy, %
US7838499 B2BrenzavvyTheracos7429223379.8%697.4%
US2022/0324863 A1Clinical candidate for Leishmaniasis Novartis13552648492.0%199.8%
US9447106 B2BrukinsaBeiGene22573269094.3%299.7%
US8410103 B2CabenuvaShionogi9426015760.4%2186.6%
US8039627 B2IngrezzaNeurocrine189555.6%0100.0%
US9592208 B2GilenyaNovartis922100.0%0100.0%
US8324208 B2JesduvroqGSK6513313198.5%298.5%
US8324225 B2KisqaliNovartis13160459097.7%199.8%
US11351149 B2PaxlovidPfizer16949736072.4%3690.0%
US8129385 B2DovatoShionogi9225414456.7%1490.3%
US7964580 B2EpclusaPharmasset256685783.8%787.7%
US7598257 B2JakafiIncyte19022116172.9%298.8%
US10342780 B2JaypircaLoxo17954845883.6%399.3%
US8207125 B2KyprolisOnyx389366.5%266.7%
US9617258 B2LitfuloPfizer14238326468.9%199.6%
US8937150 B2MavyretAbbVie32354622741.6%896.5%
US7390791 B2OdefseyGilead29502754.0%196.3%
US7342118 B2OgsiveoPfizer471
US8486941 B2OjjaaraYM Biosciences659533.2%0100.0%
US8158616 B2OlumiantIncyte791238972.4%198.9%
US7427638 B2OtezlaAmgen247342.9%0100.0%
US10406240 B2PluvictoPurdue U79862326.7%387.0%
US8101623 B2TruqapAstraZeneca8319619298.0%398.4%
US8754096 B2UbrelvyMerck34615285.2%0100.0%
US9309245 B2XacduroEntasis10736912433.6%596.0%
Total26876156448272.8%11997.3%
Number of atoms23142911999.95%
Number of bonds18514311999.94%

3.5.3 Recognition accuracy - Medium confidence (0.92-0.98 score)

The methodology for accuracy assessment of molecules with medium confidence score was similar to molecules with high confidence scores. Recognition accuracy of individual elements was not calculated because probability of molecules carrying more than one recognition error of individual elements was non-negligible.

patent IDentitycompanynumber of pagesnumber of full moleculesnumber of medium confidence moleculesfraction of medium confidence molecules, %number of errors of medium confidence moleculesrecognition accuracy, %
US7838499 B2BrenzavvyTheracos742924114.0%1563.4%
US2022/0324863 A1Clinical candidate for Leishmaniasis Novartis135526264.9%965.4%
US9447106 B2BrukinsaBeiGene225732405.5%1270.0%
US8410103 B2CabenuvaShionogi942608432.3%3163.1%
US8039627 B2IngrezzaNeurocrine189444.4%0100.0%
US9592208 B2GilenyaNovartis92
US8324208 B2JesduvroqGSK6513321.5%150.0%
US8324225 B2KisqaliNovartis131604142.3%564.3%
US11351149 B2PaxlovidPfizer1694977414.9%1086.5%
US8129385 B2DovatoShionogi922549738.2%3464.9%
US7964580 B2EpclusaPharmasset25668710.3%185.7%
US7598257 B2JakafiIncyte1902214721.3%1470.2%
US10342780 B2JaypircaLoxo1795487613.9%3356.6%
US8207125 B2KyprolisOnyx38938288.2%5631.7%
US9617258 B2LitfuloPfizer14238310928.5%2775.2%
US8937150 B2MavyretAbbVie32354631157.0%7575.9%
US7390791 B2OdefseyGilead29501836.0%477.8%
US7342118 B2OgsiveoPfizer4711100.0%0100.0%
US8486941 B2OjjaaraYM Biosciences65951818.9%1422.2%
US8158616 B2OlumiantIncyte791233125.2%680.6%
US7427638 B2OtezlaAmgen247457.1%0100.0%
US10406240 B2PluvictoPurdue U79864855.8%1862.5%
US8101623 B2TruqapAstraZeneca8319631.5%30.0%
US8754096 B2UbrelvyMerck3461813.1%275.0%
US9309245 B2XacduroEntasis10736921658.5%10551.4%
Total26876156136122.1%47565.1%

3.5.4 Recognition accuracy - Low confidence (<0.92 score)

The methodology for accuracy assessment of molecules with low confidence score was similar to molecules with high confidence scores. Recognition accuracy of individual elements was not calculated because probability of molecules carrying more than one recognition error of individual elements was non-negligible.

patent IDentitycompanynumber of pagesnumber of full moleculesnumber of low confidence moleculesfraction of low confidence molecules, %number of errors of low confidence moleculesrecognition accuracy, %
US7838499 B2BrenzavvyTheracos74292186.2%1044.4%
US2022/0324863 A1Clinical candidate for Leishmaniasis Novartis135526163.0%1225.0%
US9447106 B2BrukinsaBeiGene22573230.4%30.0%
US8410103 B2CabenuvaShionogi94260145.4%564.3%
US8039627 B2IngrezzaNeurocrine189
US9592208 B2GilenyaNovartis92
US8324208 B2JesduvroqGSK65133
US8324225 B2KisqaliNovartis131604
US11351149 B2PaxlovidPfizer1694976212.5%2658.1%
US8129385 B2DovatoShionogi92254135.1%746.2%
US7964580 B2EpclusaPharmasset2566845.9%0100.0%
US7598257 B2JakafiIncyte190221125.4%833.3%
US10342780 B2JaypircaLoxo179548142.6%378.6%
US8207125 B2KyprolisOnyx389366.5%350.0%
US9617258 B2LitfuloPfizer142383102.6%460.0%
US8937150 B2MavyretAbbVie32354661.1%60.0%
US7390791 B2OdefseyGilead2950510.0%340.0%
US7342118 B2OgsiveoPfizer471
US8486941 B2OjjaaraYM Biosciences65957477.9%731.4%
US8158616 B2OlumiantIncyte7912332.4%166.7%
US7427638 B2OtezlaAmgen247
US10406240 B2PluvictoPurdue U79861517.4%1126.7%
US8101623 B2TruqapAstraZeneca8319610.5%10.0%
US8754096 B2UbrelvyMerck346111.6%10.0%
US9309245 B2XacduroEntasis107369297.9%276.9%
Total268761563065.0%20433.3%

4. Deep Origin’s User Portal

The portal interface is designed to host multiple applications (e.g., Balto - the first AI assistant for drug discovery). It is divided into two main panels (see detailed descriptions in the following sessions):

  • Products and Settings panel
  • Application panel

4.1 Products and Account Settings Panel

The left panel provides access to various Deep Origin products and account settings. It is divided into two segments:

image.png

Products and Settings

  • Top Segment: Displays a list of your active Deep Origin products that you have activated on the Product selection page
  • Bottom Segment: Provides links to:
    • Account: Manage your account information (first name, last name, title, company, password). Clicking “Account” takes you to the account settings page.
    • Settings: Access Deep Origin’s product selection, pricing & billing details, manage team members. (See Settings Menu for details.)
    • Documentation: Direct access to this documentation
    • Support: Send support email to our customer support team at support@deeporigin.com
    • Logout: Log out of your Deep Origin account

You can collapse or expand this panel by clicking the double arrow (<<) next to your name.


5. Pricing

Subscription and pricing model

DO Patent uses pay-per-use pricing. Creating an account, monthly subscription and the analysis of the first 50 pages each month are FREE. Pages exceeding monthly free page limit will be processed according to your Pricing Tier (see below):

  • Standard Tier: $0.10 per page
  • Academic Tier: $0.06 per page

Free Page Count

You can access remaining free pages by clicking on Settings and then on the Billing tab.

image.png

Pricing Tiers

DO Patent has two pricing tiers:

  • Standard pricing tier
  • Academic pricing tier

You will automatically get an academic tier if you sign up with your .edu account.

DO Patent has both monthly free pages and an associated cost per page when the free page limit is exceeded. DO Patent charges and free pages appear in the billing view as “PDF analysis”. Free pages are consumed first. Pages exceeding the free limit will be charged according to you Pricing Tier:

  • Standard Tier: $0.10 per page
  • Academic Tier: $0.06 per page

If you would like to adjust your pricing tier or discuss additional pricing options, then please contact support at support@deeporigin.com.

You can always review your aggregated page count as well as specific actions breakdown by clicking on Settings and then on the Billing tab.

image.png

Auto-approval

DO Patent has a default auto-approve threshold set at $50.

DO Patent will automatically execute any actions that will cost less than $50 and will ask for permission to proceed if the job will cost more than $50.

Billing cycle

The credit card on file will be charged after the end of the month for any paid tool actions performed that month.

Credit usage

You will automatically receive a $500 credit limit when you enter a payment method. This credit limit allows you to process large PDF documents in excess of the monthly free actions allowance.

The mechanics of this credit limit are similar to a credit card limit. Credit usage accumulates all your unpaid charges for the current month (billing cycle) and unpaid charges for the previous month (billing cycle). Once the bill for the previous billing cycle is paid, your credit usage will be lowered by the paid bill amount. Once you have hit your credit limit, you will not be able to perform additional paid actions without contacting our support team at support@deeporigin.com.

You can always access your credit usage limit by clicking on Settings and then on the Billing tab.

image.png

Payments via Purchase Order

Please, contact our customer support team for this request at support@deeporigin.com


6. Settings Menu

6.1 Products

Clicking “Settings” in the left navigation panel takes you to the Products tab. You can see all available Deep Origin products and add products to your product list. Subscription to additional products is FREE.

image.png

6.2 Billing tab

The Billing tab shows details about your current subscription, available tools, current charges and tool usage. The view consists of five sections:

image.png

  • Features: Shows your current pricing tier, billing cycle and lists available paid tools with free actions limit and pricing.

image.png

  • Account balance: Shows current credit limit usage, current payment method (if one is set up) and auto-approval threshold (if payment method is set up).

image.png

  • Payment history: A table displaying past invoices for paid tool usage.
  • Monthly overview: Lists all actions executed during the current month. The view is broken down into five columns:
    • Period and tool names: You can select a different period by clicking on the month. Note, if a particular tool was not used during this month, it will not show up on this list.
    • Total count: Shows total count of free and paid actions.
    • Free actions: Shows total count of used free actions and available limit of free actions in the format of XX used of YY available.
    • Paid actions: Shows total count of actions above the free actions limit.
    • Amount: Shows charges calculated by multiplying your paid actions count and the price of action.

image.png

  • Recent activities: Shows you a list of the last 10 executed Premium tools actions.

image.png

6.3 Members tab

The Members tab shows current members in your Deep Origin Organization, enables inviting and deleting new members and changing their roles.

  • Members list: Current members of your organization and their roles (e.g., “Owner”, “Admin”, “Pending”)

image.png

  • Invite member: Enter email address to invite new members in a pop-up window.

    image.png

image.png

  • Edit and delete member: Edit member’s roles and delete existing members from your Deep Origin Organization.

image.png


7. Support

For additional guidance, contact us through our support team for assistance at support@deeporigin.com