IEPile

IEPile：大规模信息提取语料库

这是论文 IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus 的官方仓库

我们精心收集并清洗了现有的信息提取（IE）数据，共整合了26个英文IE数据集和7个中文IE数据集。如图1所示，这些数据集覆盖了包括通用、医学、金融等多个领域。

本研究采用了所提出的“基于schema的轮询指令构造方法”，成功创建了一个名为 IEPile 的大规模高质量IE微调数据集，包含约0.32B tokens。

基于IEPile，我们对 Baichuan2-13B-Chat 和 LLaMA2-13B-Chat 模型应用了 Lora 技术进行了微调。实验证明，微调后的 Baichuan2-IEPile 和 LLaMA2-IEPile 模型在全监督训练集上成绩斐然，并且在零样本信息提取任务中取得了提升。

如果您使用IEPile或代码，请引用以下论文：

@article{DBLP:journals/corr/abs-2402-14710, author = {Honghao Gui and Hongbin Ye and Lin Yuan and Ningyu Zhang and Mengshu Sun and Lei Liang and Huajun Chen}, title = {IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus}, journal = {CoRR}, volume = {abs/2402.14710}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2402.14710}, doi = {10.48550/ARXIV.2402.14710}, eprinttype = {arXiv}, eprint = {2402.14710}, biburl = {https://dblp.org/rec/journals/corr/abs-2402-14710.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

数据与资源

其他信息

域	价值
源	https://huggingface.co/datasets/zjunlp/iepie
作者	Guihonghao
维护者	Guihonghao
最近更新	二月 26, 2024, 07:15 (UTC)
创建的	二月 26, 2024, 07:15 (UTC)