RegioML: predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Standard

RegioML : predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning. / Ree, Nicolai; Göller, Andreas H.; Jensen, Jan H.

I: Digital Discovery, Bind 1, Nr. 2, 2022, s. 108-114.

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Harvard

Ree, N, Göller, AH & Jensen, JH 2022, 'RegioML: predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning', Digital Discovery, bind 1, nr. 2, s. 108-114. https://doi.org/10.1039/D1DD00032B

APA

Ree, N., Göller, A. H., & Jensen, J. H. (2022). RegioML: predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning. Digital Discovery, 1(2), 108-114. https://doi.org/10.1039/D1DD00032B

Vancouver

Ree N, Göller AH, Jensen JH. RegioML: predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning. Digital Discovery. 2022;1(2):108-114. https://doi.org/10.1039/D1DD00032B

Author

Ree, Nicolai ; Göller, Andreas H. ; Jensen, Jan H. / RegioML : predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning. I: Digital Discovery. 2022 ; Bind 1, Nr. 2. s. 108-114.

Bibtex

@article{599a3bf0e3934cf189cf69506a03f696,
title = "RegioML: predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning",
abstract = "We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution reactions. The model relies on CM5 atomic charges computed using semiempirical tight binding (GFN1-xTB) combined with a light gradient boosting machine (LightGBM). The model is trained and tested on 21 201 bromination reactions with 101k reaction centers, which are split into training, test, and out-of-sample datasets with 58k, 15k, and 27k reaction centers, respectively. The accuracy is 93% for the test set and 90% for the out-of-sample set, while the precision (the percentage of positive predictions that are correct) is 88% and 80%, respectively. The test-set performance is very similar to that of the graph-based WLN method developed by Struble et al. (React. Chem. Eng., 2020, 5, 896–902) though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physics-based RegioSQM20 method (Nicolai Ree, Andreas H. G{\"o}ller, Jan H. Jensen, J. Cheminf., 2021, 13, 10) where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20. The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. However, for reactions where there is little experimental data, physics-based approaches like RegioSQM20 can be used to generate synthetic data for model training. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML-model trained on RegioSQM20-generated data.",
author = "Nicolai Ree and G{\"o}ller, {Andreas H.} and Jensen, {Jan H.}",
year = "2022",
doi = "10.1039/D1DD00032B",
language = "English",
volume = "1",
pages = "108--114",
journal = "Digital Discovery",
issn = "2635-098X",
publisher = "Royal Society of Chemistry",
number = "2",

}

RIS

TY - JOUR

T1 - RegioML

T2 - predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning

AU - Ree, Nicolai

AU - Göller, Andreas H.

AU - Jensen, Jan H.

PY - 2022

Y1 - 2022

N2 - We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution reactions. The model relies on CM5 atomic charges computed using semiempirical tight binding (GFN1-xTB) combined with a light gradient boosting machine (LightGBM). The model is trained and tested on 21 201 bromination reactions with 101k reaction centers, which are split into training, test, and out-of-sample datasets with 58k, 15k, and 27k reaction centers, respectively. The accuracy is 93% for the test set and 90% for the out-of-sample set, while the precision (the percentage of positive predictions that are correct) is 88% and 80%, respectively. The test-set performance is very similar to that of the graph-based WLN method developed by Struble et al. (React. Chem. Eng., 2020, 5, 896–902) though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physics-based RegioSQM20 method (Nicolai Ree, Andreas H. Göller, Jan H. Jensen, J. Cheminf., 2021, 13, 10) where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20. The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. However, for reactions where there is little experimental data, physics-based approaches like RegioSQM20 can be used to generate synthetic data for model training. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML-model trained on RegioSQM20-generated data.

AB - We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution reactions. The model relies on CM5 atomic charges computed using semiempirical tight binding (GFN1-xTB) combined with a light gradient boosting machine (LightGBM). The model is trained and tested on 21 201 bromination reactions with 101k reaction centers, which are split into training, test, and out-of-sample datasets with 58k, 15k, and 27k reaction centers, respectively. The accuracy is 93% for the test set and 90% for the out-of-sample set, while the precision (the percentage of positive predictions that are correct) is 88% and 80%, respectively. The test-set performance is very similar to that of the graph-based WLN method developed by Struble et al. (React. Chem. Eng., 2020, 5, 896–902) though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physics-based RegioSQM20 method (Nicolai Ree, Andreas H. Göller, Jan H. Jensen, J. Cheminf., 2021, 13, 10) where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20. The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. However, for reactions where there is little experimental data, physics-based approaches like RegioSQM20 can be used to generate synthetic data for model training. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML-model trained on RegioSQM20-generated data.

U2 - 10.1039/D1DD00032B

DO - 10.1039/D1DD00032B

M3 - Journal article

VL - 1

SP - 108

EP - 114

JO - Digital Discovery

JF - Digital Discovery

SN - 2635-098X

IS - 2

ER -

ID: 338532152