Projet CoRRecT
Reference Corpus for the Recognition of termes

(Corpus de Référence pour la Reconnaissance de Termes)

CoRRecT is a methodology to evaluate systems that recognize terms (SRT) in corpus. This methodology is based on the making of a reference corpus in which the occurrences of the recognized terms are tagged. This reference corpus is established with the help of specialists of the domain of the texts.

Evaluation of a SRT

A SRT is evaluated by comparing its results with the reference corpus.

This evaluation is obtained by

In exchange, an evaluation report will be sent. This evaluation report is a text file. It consists of :

Getting the data test

Sending the results

Sending an email including a brief description of the evaluated SRT
Attaching to this email the files with the results (these files are written in a specific format).

Formats

All the file are written according with the XML syntax.

  • Texts
  • Each text consists of several short articles.
    Each short article is tagged by notice and identified by the attribute id.
    The text is tagged by texte. See the DTD.

    Example :

  • Terms
  • Each term is tagged by terme and identified by the attribute id. Its regular form is tagged by vedette.
    Optional additional information (translation, syntactic structure, etc.) is tagged by info. See the DTD.

    Exemple :

  • Results of a SRT
  • The results of a SRT should be written according to the XML format (see the DTD).
    Each recognized occurrence of a term is tagged by variante. The attribute refterme identifies the term which has been recognized, the attribute statut allows to note any additional information produced by the SRT, and the attributes debut and fin identify the beginning and the end of the occurrence in the text.
    Inside the text are situated the tags ancre which signal the beginning and the end of each occurrence.

    Exemple :


    Constraints : The text of the short article should not be modified, it is only allowed to add tags and it is forbidden to add or delete some characters
    Nevertheless, some minor modifications are tolerated :
    - the conversion of characters into html entities,
    - the conversion of html entities into characters.

    Publications :