How to pretrain an efficient cross-disciplinary language model: the scilitbert use case

la Broise, JBD; Bernard, N; Dubuc, JP; Perlato, A; Latard, B

UoM IR
→
Research Publications
→
Conference Proceedings
→
UoM Conferences
→
Information Technology Research Unit (ITRU & ICITR)
→
ICITR - 2021
→
View Item

How to pretrain an efficient cross-disciplinary language model: the scilitbert use case

la Broise, JBD; Bernard, N; Dubuc, JP; Perlato, A; Latard, B

URI: http://dl.lib.uom.lk/handle/123/19439

DOI: doi: 10.1109/ICITR54349.2021.9657164

Abstract:

Transformer based models are widely used in various text processing tasks, such as classification, named entity recognition. The representation of scientific texts is a complicated task, and the utilization of general English BERT models for this task is suboptimal. We observe the lack of models for multidisciplinary academic texts representation, and on a broader scale, a lack of specialized models pretrained on specific domains, for which general English BERT models are suboptimal. This paper introduces ScilitBERT, a BERT model pretrained on an inclusive cross-disciplinary academic corpus. ScilitBERT is half as deep as RoBERTa, and has a much lower pretraining computation cost. ScilitBERT obtains at least 96% of RoBERTa's accuracy on two academic domain downstream tasks. The presented cross-disciplinary academic model has been publicly released11https://github.com/JeanBaptiste-dlb/ScilitBERT. The results obtained show that for domains that use a technolect and have a sizeable amount of raw text data; the pretraining of dedicated models should be considered and favored.

Citation:

J. -B. de la Broise, N. Bernard, J. -P. Dubuc, A. Perlato and B. Latard, "How to pretrain an efficient cross-disciplinary language model: The ScilitBERT use case," 2021 6th International Conference on Information Technology Research (ICITR), 2021, pp. 1-6, doi: 10.1109/ICITR54349.2021.9657164.

Show full item record

Files in this item

Name: ICITR2021_paper_82.pdf

Size: 347.4Kb

Format: PDF

This item appears in the following Collection(s)

ICITR - 2021 [39]
International Conference on Information Technology Research (ICITR)

Search UoM-IR

Browse

All of UoM-IR
This Collection
- Authors
- Titles
- Subjects
- Faculty
- Acc. No.
- Document Type
- Year
- Conference Proceedings

How to pretrain an efficient cross-disciplinary language model: the scilitbert use case

How to pretrain an efficient cross-disciplinary language model: the scilitbert use case

Abstract:

Citation:

Files in this item

This item appears in the following Collection(s)

Search UoM-IR

Browse

All of UoM-IR

This Collection

My Account