Molinspiration Cheminformatics

Misearch - Molinspiration Molecular Search Engine, v2023.10

Misearch is a molecular search system, which allows substructure, structure similarity and exact searches, and easy creation of web-enabled molecular databases.

The misearch engine is written completely in Java, and therefore may be used on any platform where the Java (version 11 or later) is installed. Java is currently supported practically on all platforms (Windows, LINUX, Unix). The latest version of Java runtime environment may be downloaded for free from www.java.sun.com. (You may find out which version of Java is installed on your machine by command java -version). No other software is required to run misearch.
Misearch toolkit has no limit on the size of molecular database used. Some of our customers are using databases with several million molecules. The typical structure or similarity search in a database of 100'000 molecules requires 3-4 seconds on an average, single-processor PC.

The misearch engine does not use any special database software (like ORACLE, or mySQL), data are stored in a simple ASCII file. The big advantage of this set-up is, that it makes creation and use of misearch databases and molecule searches very simple and straightforward. On the other side, the misearch cannot provide sophisticated database functionality, such as storage of all types of additional data, or modification of data in the existing database. Each time the data change, the misearch database file must be created from scratch.

Using the misearch engine

Misearch functions are available from DOS or UNIX command line.

As a first step, molecular database must be created from SMILES data. Molinspiration can provide a free conversion software to transform MDL SDfiles to SMILES.
The database is created by a command

java -jar misearch.jar -f source_file -create > database

source_file is a file with information about molecules to be stored in the database, one molecule per line, encoded as a SMILES string, tab separated from molecule identifier (molecule name) and optionally also additional data.

This command will generate a database file, which is sent to the standard output and may be redirected into a file (you may use, of course, the name you wish) by using the > character.

Created database is a simple text file, one line per molecules, which contains standardized SMILES, molecule structure code (a string of characters which encodes molecular structure in a compact way), molecule identifier, and any additional data contained in the source file.

Progress of database creation will be shown on the screen. Creation of a database with 10,000 molecules will require about 10 minutes.

New molecules may be added to the end of existing database file by a command:

java -jar misearch.jar -f source_file >> database
(note the >> redirection command)

Once a database is created, searches are possible, by a command:

java -jar misearch.jar -db database -search_type -smi 'target' [-options] > resultFile

where

database_name is name of the database

-search_type is type of the search; currently 4 search types are available:
-simisearch similarity search
-sssearch substructure search
-exactsearch exact molecule search
-namesearch text search in the name field

target is a target molecule in SMILES or SMARTS format, or target text for text searches

possible options are:

-jme hits will be provided not as SMILES, but as JME strings (molecule encoded as JME string may be displayed by the JME applet).

-slimit value may be used to set minimum required similarity in the similarity search (default value is 0.65).

-nhits n limits on the number of hits. By substructure and name searches when hitlist size reaches this limit the search is terminated. By similarity searches the n most similar molecules are found and sent to the output order according to their similarity value.

-skip n skip n molecules at the beginning of the database file. May be used to continue search after reaching the hit limit in previous search.

A search command may look for example like this:

java -jar misearch.jar -db nci -sssearch -smi 'c1cccccn1=O' > out

Output of the search is sent to the standard output, which may be redirected to any file. Hits are written to the output file, one molecule per line, starting with molecule SMILES (or JME code if the parameter -jme is used), and molecule identifier / name. In case of similarity searches the similarity to the target molecule is also provided. If the original entry contained any additional data, all these data are added to the end of line. The entries in the line are separated by tabs.
When the search is finished, the number of hits and time required are sent to the standard error.

Possible search types

Substructure search will find molecules containing given substructure. Substructure queries may be submitted as SMILES or SMARTS (in this case use the keyword -smarts instead of -smi). SMARTS syntax allows specification of complex substructure queries. Complete SMARTS specification is implemented, including the recursive SMARTS, only stereo SMARTS queries are not supported. Be aware, however, that SMARTS searches are much slower than simple SMILES substructure searches.

Similarity search will identify molecules most similar to the target structure. Similarity is expressed as a number between 1.0 (identical, or very similar molecules) to 0.0 (no similarity at all). Molecules with similarity greater then ca 0.7 may be considered to be reasonably similar. By default all molecules with similarity to the target greater than 0.65 are identified by the search. This limit may be changed by a modifier -slimit value. By default all molecules with similarity to the target greater than slimit are identified and sent to the output not ordered. When using the parameter -nhits n, n most similar molecules are found and sent to the output ordered according to their similarity value. The -slimit parameter cannot be used together with the -nhits parameter.

Exact search will find exactly the same molecules as the target SMILES. Stereochemistry is not considered, so all stereoisomers of the same basic connectivity will be found.

Name search performs text search in the name field. When using the name search, you have to use the -text searchText parameter to define the text query (instead of a -smi parameter). All names containing the text query are identified, so for example search with 'thia' identifies 'thiazole', as well as 'benzothiazole' as hits.

Interactive web-based structure search

To demonstrate performance of the misearch database engine you may check this site.

For more information about the misearch database engine, or to arrange a free evaluation, contact info(at)molinspiration.com.