dc.contributor.author |
la Broise, JBD |
|
dc.contributor.author |
Bernard, N |
|
dc.contributor.author |
Dubuc, JP |
|
dc.contributor.author |
Perlato, A |
|
dc.contributor.author |
Latard, B |
|
dc.contributor.editor |
Ganegoda, GU |
|
dc.contributor.editor |
Mahadewa, KT |
|
dc.date.accessioned |
2022-11-09T08:25:37Z |
|
dc.date.available |
2022-11-09T08:25:37Z |
|
dc.date.issued |
2021-12 |
|
dc.identifier.citation |
J. -B. de la Broise, N. Bernard, J. -P. Dubuc, A. Perlato and B. Latard, "How to pretrain an efficient cross-disciplinary language model: The ScilitBERT use case," 2021 6th International Conference on Information Technology Research (ICITR), 2021, pp. 1-6, doi: 10.1109/ICITR54349.2021.9657164. |
en_US |
dc.identifier.uri |
http://dl.lib.uom.lk/handle/123/19439 |
|
dc.description.abstract |
Transformer based models are widely used in various text processing tasks, such as classification, named entity recognition. The representation of scientific texts is a complicated task, and the utilization of general English BERT models for this task is suboptimal. We observe the lack of models for multidisciplinary academic texts representation, and on a broader scale, a lack of specialized models pretrained on specific domains, for which general English BERT models are suboptimal. This paper introduces ScilitBERT, a BERT model pretrained on an inclusive cross-disciplinary academic corpus. ScilitBERT is half as deep as RoBERTa, and has a much lower pretraining computation cost. ScilitBERT obtains at least 96% of RoBERTa's accuracy on two academic domain downstream tasks. The presented cross-disciplinary academic model has been publicly released11https://github.com/JeanBaptiste-dlb/ScilitBERT. The results obtained show that for domains that use a technolect and have a sizeable amount of raw text data; the pretraining of dedicated models should be considered and favored. |
en_US |
dc.language.iso |
en |
en_US |
dc.publisher |
Faculty of Information Technology, University of Moratuwa. |
en_US |
dc.relation.uri |
https://ieeexplore.ieee.org/document/9657164/ |
en_US |
dc.subject |
Language models |
en_US |
dc.subject |
Clustering |
en_US |
dc.subject |
Classification |
en_US |
dc.subject |
Association rules |
en_US |
dc.subject |
Benchmarking |
en_US |
dc.subject |
Text analysis |
en_US |
dc.title |
How to pretrain an efficient cross-disciplinary language model: the scilitbert use case |
en_US |
dc.type |
Conference-Full-text |
en_US |
dc.identifier.faculty |
IT |
en_US |
dc.identifier.department |
Information Technology Research Unit, Faculty of Information Technology, University of Moratuwa. |
en_US |
dc.identifier.year |
2021 |
en_US |
dc.identifier.conference |
6th International Conference in Information Technology Research 2021 |
en_US |
dc.identifier.place |
Moratuwa, Sri Lanka |
en_US |
dc.identifier.proceeding |
Proceedings of the 6th International Conference in Information Technology Research 2021 |
en_US |
dc.identifier.doi |
doi: 10.1109/ICITR54349.2021.9657164 |
en_US |