Home | Activities | Resources & Tools | Portal | Publications | Events | People | Links |
This activity is motivated by the need of automatic ways of extracting information from the huge amuont of text available on the web. The approach selected to address this issue is based on the employment of algorithms that can automatically learn rules from large numbers of examples. The result of this activity is EntityPro, a system that can automatically recognize four different types of named entities (i.e. persons, organizations, locations and geo-political entities) using Support Vector Machine (SVMs) as learning algorithms, trained on the examples extracted from the I-CAB corpus. EntityPro is available within the TextPro platform, which is dedicated to the linguistic analysis of texts in Italian.
The aim of this activity is the creation of resources which can be use to train and evaluate tecnologies which automatically extract information from texts in Italian. The work consists of manually annotating newpaper aricles with temporal expressions (eg. "ieri"), entities of different types, i.e. persons (eg. "Ciampi"), organizations (eg. "la FIAT"), locations (es. "il Po") and geo-political entities (eg. "il Trentino Alto Adige") and relations between entities. Within the project, it has been followed the approach proposed by the ACE (Automatic Content Extraction) initiative, whose annotation guidelines for English have been taken and modified to adapt them to Italian. The result of this activity is the I-CAB corpus (Italian Content Annotation Bank), which at present is annotated with temporal expressions and entities. Limited to persons, mentions in the corpus are annotated in more detail, by specifying a set of attributes, such as age (eg. "diciottenne") and homeland/hometown (eg. "Trentino"); co-reference between persons mentioned in different documents is also marked.
The activity focuses on the study of algorithms for automatically determining the semantic properties of terms with respect to the expression of subjectivity. The chosen approach is based on the use, innovative for this task, of random-walk models that are well-known in the field of Information Retrieval; in particular, special attention is paid to identifying the semantic polarity of terms. The result is represented by SentiWordNet, a lexical resource in which each WordNet synset is associated with three numerical scores describing how objective, positive, and negative the terms contained in the synset are. This activity also addresses the study of methods/techniques of information extraction for opinion mining, i.e. the automatic annotation of opinions in texts. The approach chosen to address this issue within the project is based on specific models of automatic learning, the Conditional Random Fields. For the evaluation of the technologies that are being developed, a new annotation level has been added to the I-CAB corpus; the annotation of subjective expressions consists of annotating the structure of the opinions expressed in the text, by identifying the opinio itself (eg. "miglior giocatore in campo"), the person who expresses it (eg. "l'allenatore") and its object (eg. "il capitano").
The objective of this activity is to build a large knowledge base starting from linguistic sources as well as Web resources. In the project we take a semi-automatic approach, composed of a boosting phase and an automatic learning phase. In the boosting phase a number of ontologies are manually constructed. Such ontologies are used in the second phase for the automatic population of other ontologies. The results obtained within the Ontotext project are: (i) a set of domain ontologies on people, occupations and local geography; (ii) a method for the automatic population of a person ontology starting from mentions extracted automatically from texts; (iii) the implementation of a co-reference module, i.e. an algorithm to decide with a certain degree of confidence wether two mentions refer to the same entity or not.
Maintainer: bentivofbk.eu
Last modified: Tue Aug 28 2007