Science | MergenKit methodology and principles

Mission

Computational drug discovery has long offered powerful tools to teams who could afford the licensing and the engineering effort needed to operate them. MergenKit exists to close that gap. The platform offers a guided no-code workspace where structure-activity and structure-property relationships are modelled, interpreted, and reported within a single workflow.

The mission is straightforward. Make rigorous predictive modelling routine. Make interpretability the default, not a separate analysis. Make reporting structures match the documentation frameworks used by regulated industries, so the scientific record stands up to scrutiny.

MergenKit is designed for researchers working in drug discovery, including computational scientists and pharmaceutical R&D teams. The platform handles the workflow plumbing so teams focus on hypothesis generation and candidate evaluation, which is where their scientific judgement matters most.

Principles

Interpretability alongside every prediction

Interpretability is central to the platform. Each prediction is paired with an analysis that links model output to the underlying molecular features, surfacing which descriptors drove the result and how strongly each contributed. The accompanying descriptor dictionary provides the mathematical formulation and physical interpretation for every term, so researchers can evaluate the basis of a prediction directly rather than relying on reported metrics alone. Every prediction in MergenKit is paired with an analysis that links model output to the underlying molecular features. Researchers see which descriptors contributed to the result and read the chemical interpretation of each descriptor in the same view. This combination prevents the common failure mode where high reported accuracy hides a model that has latched onto an artefact of the training data.

Reproducibility encoded in the export

A trained model is not just weights. It is the configuration, the preprocessing steps, the descriptor selection, the scaling parameters, the validation strategy, and the random seed. MergenKit exports all of these together so the run can be replicated externally, audited, or extended on a new molecular library. This treatment is essential for iterative discovery where the same modelling decisions must be applied consistently across compound batches.

Applicability domain as a standard output

A QSAR or QSPR model is reliable only for molecules that resemble the training distribution. MergenKit attaches an applicability domain analysis to every prediction, identifying whether the input molecule lies within the chemical space represented during model training. Researchers receive a clear signal when a prediction is in-domain and trustworthy versus out-of-domain and requiring confirmatory work. This is the foundation of scientifically defensible computational chemistry.

Reporting structures aligned with regulated assessments

MergenKit produces scientific records consistent with documentation frameworks used in regulated assessments. The platform structures model documentation in QMRF form and prediction documentation in QPRF form. These records align with the requirements of frameworks like ICH M7 and REACH. MergenKit itself does not act as a regulatory authority and does not perform independent regulatory assessments. The scientific reporting layer remains separate from regulatory decision-making, which is the responsibility of the user organisation. This boundary is deliberate and important.

Validation

Model validation in MergenKit is not a single number reported at the end of training. It is a multi-layered process built into the workflow.

Scaffold-aware partitioning

Standard random splitting underestimates the difficulty of generalising to truly novel chemistry. MergenKit partitions data by scaffold, so structurally similar molecules do not appear on both sides of the train and test split. The resulting performance estimates reflect the model's behaviour on chemistry it has not seen before, which is the question that matters for prospective use.

Cross-validation and independent test sets

Performance is evaluated through both cross-validation and independent test sets. Cross-validation captures the variability of model performance across data folds; the independent test set provides a held-out check against optimistic estimates from the training loop. The two together give researchers a defensible characterisation of model quality.

Applicability domain in practice

For every prediction made on new molecules, MergenKit reports whether the input is inside or outside the applicability domain of the trained model. This is reported in QPRF documentation alongside the prediction itself, giving researchers a scientifically grounded basis to act on or set aside individual results.

Calibrated confidence for transfer learning

When data is too sparse for a reliable end-to-end model, the transfer learning module uses a base model trained on open-access chemistry data, then fine-tunes on the smaller target dataset. Predictions on under-labelled compounds are reported with calibrated confidence estimates, so researchers know the realistic uncertainty of each result rather than an overconfident point estimate.

What sets MergenKit apart

A unified pipeline in place of fragmented toolchains

Most computational drug discovery work today moves between separate tools for data preparation, descriptor generation, modelling, interpretation, and reporting. Each handoff introduces opportunities for procedural drift. MergenKit collapses the pipeline into a single guided workflow, so the configuration that runs preprocessing also runs modelling, also runs interpretation, also runs reporting. The output of each stage is verified before the next stage begins.

Interpretability integrated into the workflow

Explainability is the platform's default operating mode, not an optional analysis run after the fact. Each prediction surfaces feature attribution, the relevant descriptor dictionary entries, and the applicability domain check in the same interface. Researchers do not move between three tools to interpret one prediction.

Reporting structures generated from the same configuration

QMRF and QPRF are the documentation structures used in regulated assessments. Producing them by hand from notebook outputs is tedious and error-prone. MergenKit generates them automatically from the same configuration that drove the modelling, so the record matches the run.

No programming background required

The workflow runs through configuration menus. A medicinal chemist or a pharmacology team lead can operate the platform without writing or maintaining code. Computational scientists keep the ability to inspect, export, and extend every artefact the workflow produces.

Explore the three analytical modules for a walkthrough of how predictive modelling, transfer learning, and multi-objective optimization share the same workflow infrastructure.

Data preparation

Most modelling failures originate before training begins. Inconsistent stereochemistry, undisclosed duplicates, mismatched tautomers, and salts left attached to the parent molecule all distort the relationship the model is asked to learn. MergenKit treats data preparation as a structured stage in the pipeline rather than a hidden preprocessing step.

Structural validation

Every uploaded molecule passes through structural validation before it can enter the modelling stage. Invalid SMILES, disconnected fragments, and structures inconsistent with the declared target variable are flagged and reported to the user in a single review interface, so the researcher sees the data quality picture before committing to a study.

Standardisation and deduplication

Standardisation aligns tautomeric forms, neutralises charge states, and normalises salt representations consistently across the dataset. Deduplication identifies records that differ only in formatting, in salt presence, or in stereochemistry annotation. This ensures the training set reflects the chemistry the researcher intended to model, rather than a noisy aggregation of records.

Descriptor selection

Molecular representations are fully configurable. Researchers choose among descriptor types and fingerprint families based on the requirements of the study, and they can compare alternatives within the same workflow. Dimensionality reduction and feature selection options are available prior to the modelling stage, so researchers can balance signal richness against the risk of overfitting that comes with very high-dimensional feature spaces on smaller datasets.

Model deployment

A model is most useful when it is applied to molecules the team has not yet measured. MergenKit treats deployment as the natural continuation of the workflow, not a separate engineering project.

Once a model is trained and validated, it can be applied directly to new molecular libraries through the same configuration interface that drove the training run. Predictions arrive paired with their applicability domain assessment and feature attribution, so the researcher sees both the prediction and the basis for trusting it. Calibrated confidence estimates accompany predictions from transfer learning runs, so under-labelled compounds are not treated as if they had the same evidence base as well-characterised series.

Trained models also export with their full configuration, preprocessing steps, and scaling parameters. This means the model can be replicated externally, audited by a regulatory affairs colleague, or extended on a new compound batch without depending on the original training environment. The platform is the workflow; the exported model is the artefact the team takes forward.

Read the mission of the platform for the longer-term direction.

Reporting

Two documentation structures are central to MergenKit's reporting layer. Both are widely used in regulated assessments and align with the requirements of frameworks like ICH M7 and REACH.

QMRF: model documentation

The QSAR Model Reporting Format documents what a model is, how it was built, what data it was trained on, what its applicability domain is, and how it was validated. MergenKit fills this structure automatically from the same configuration used to train the model. The result is an audit-ready record of the model itself.

QPRF: prediction documentation

The QSAR Prediction Reporting Format documents an individual prediction made using a documented model. It records the input molecule, the model identifier, whether the molecule is inside the applicability domain, and the predicted value with its supporting interpretation. MergenKit attaches a QPRF entry to every reportable prediction.

Alignment, not authority

While MergenKit generates reports compatible with these standards, the platform does not act as a regulatory authority. Regulatory decisions remain the responsibility of the user organisation. This boundary protects the integrity of both the scientific workflow and the regulatory process.

References

Standards documents.

External documentation for the frameworks MergenKit's reporting layer aligns with.

QMRF: QSAR Model Reporting Format

European Commission, Joint Research Centre. Standard structure for documenting QSAR models.

View reference repository

QPRF: QSAR Prediction Reporting Format

European Commission, Joint Research Centre. Standard structure for documenting individual QSAR predictions.

JRC scientific tools

ICH M7: Assessment and control of DNA reactive impurities

International Council for Harmonisation. Guideline for assessing genotoxic impurities, including QSAR-based assessment.

ICH guidelines

REACH: Registration, Evaluation, Authorisation and Restriction of Chemicals

European Chemicals Agency. EU regulation for chemical safety assessment.

ECHA reference

Take the methodology for a test run.

A demo session covers your modelling question, your data, and how MergenKit's principles apply to your discovery workflow.

Request a demo