基于数据增强的NLP新方法[Appl. Sci.专栏第十六篇发表论文]

已有 1899 次阅读 2023-3-25 15:53 |系统分类:论文交流

我在Applied Sciences(综合性、交叉性期刊,CiteScore=3.70IF=2.84)组织了一个Special Issue,大题目是“大数据分析进展”,比较宽泛。该专栏的推出主要是为了回应因为可获取数据和数据分析的平台、工具的快速增长给自然科学和社会科学带来的重大影响。我们特别欢迎(但不限于)下面四类稿件:(1)数据分析中的基础理论分析,例如一个系统的可预测性(比如时间序列的可预测性)、分类问题的最小误差分析、各种数据挖掘结果的稳定性和可信度分析;(2)数据分析的新方法,例如挖掘因果关系的新方法(这和Topic 1也是相关的)、多模态分析的新方法、隐私计算的新方法等等;(3)推出新的、高价值的数据集、数据分析平台、数据分析工具等等;(4)把大数据分析的方法用到自然科学和社会科学的各个分支(并获得洞见),我们特别喜欢用到那些原来定量化程度不高的学科。




A Joint Domain-Specific Pre-Training Method Based on Data Enhancement


State-of-the-art performances for natural language processing tasks are achieved by supervised learning, specifically, by fine-tuning pre-trained language models such as BERT (Bidirectional Encoder Representation from Transformers). With increasingly accurate models, the size of the fine-tuned pre-training corpus is becoming larger and larger. However, very few studies have explored the selection of pre-training corpus. Therefore, this paper proposes a data enhancement-based domain pre-training method. At first, a pre-training task and a downstream fine-tuning task are jointly trained to alleviate the catastrophic forgetting problem generated by existing classical pre-training methods. Then, based on the hard-to-classify texts identified from downstream tasks’ feedback, the pre-training corpus can be reconstructed by selecting the similar texts from it. The learning of the reconstructed pre-training corpus can deepen the model’s understanding of undeterminable text expressions, thus enhancing the model’s feature extraction ability for domain texts. Without any pre-processing of the pre-training corpus, the experiments are conducted for two tasks, named entity recognition (NER) and text classification (CLS). The results show that learning the domain corpus selected by the proposed method can supplement the model’s understanding of domain-specific information and improve the performance of the basic pre-training model to achieve the best results compared with other benchmark methods.


上一篇:用新的模块化指标来提升社团挖掘效果[Appl. Sci.专栏第十五篇发表论文]
下一篇:分析CT图片的一种融合架构[Appl. Sci.专栏第十七篇发表论文]

1 杨正瓴

该博文允许注册用户评论 请点击登录 评论 (0 个评论)


Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2023-12-10 06:38

Powered by

Copyright © 2007- 中国科学报社