International audience ; This paper presents the \Leipzig Corpus Miner"|a technical infrastructure for supporting qualitative and quantitative content analysis. The infrastructure aims at the integration of \close reading" procedures on individual documents with procedures of \distant reading", e.g. lexical characteristics of large document collections. Therefore information retrieval systems, lexicometric statistics and machine learning procedures are combined in a coherent framework which enables qualitative data analysts to make use of state-of-the-art Natural Language Processing techniques on very large document collections. Applicability of the framework ranges from social sciences to media studies and market research. As an example we introduce the usage of the framework in a political science study on post-democracy and neoliberalism.
International audience ; Dialectal Arabic (DA) is significantly different from the Arabic language taught in schools and used in written communication and formal speech (broadcast news, religion, politics, etc.). There are many existing researches in the field of Arabic language Sentiment Analysis (SA); however, they are generally restricted to Modern Standard Arabic (MSA) or some dialects of economic or political interest. In this paper we focus on SA of the Tunisian dialect. We use Machine Learning techniques to determine the polarity of comments written in Tunisian dialect. First, we evaluate the SA systems performances with models trained using freely available MSA and Multi-dialectal data sets. We then collect and annotate a Tunisian dialect corpus of 17.000 comments from Facebook. This corpus shows a significant improvement compared to the best model trained on other Arabic dialects or MSA data. We believe that this first freely available corpus will be valuable to researchers working in the field of Tunisian Sentiment Analysis and similar areas
International audience ; Dialectal Arabic (DA) is significantly different from the Arabic language taught in schools and used in written communication and formal speech (broadcast news, religion, politics, etc.). There are many existing researches in the field of Arabic language Sentiment Analysis (SA); however, they are generally restricted to Modern Standard Arabic (MSA) or some dialects of economic or political interest. In this paper we focus on SA of the Tunisian dialect. We use Machine Learning techniques to determine the polarity of comments written in Tunisian dialect. First, we evaluate the SA systems performances with models trained using freely available MSA and Multi-dialectal data sets. We then collect and annotate a Tunisian dialect corpus of 17.000 comments from Facebook. This corpus shows a significant improvement compared to the best model trained on other Arabic dialects or MSA data. We believe that this first freely available corpus will be valuable to researchers working in the field of Tunisian Sentiment Analysis and similar areas
International audience ; legivoc is the first Internet-based platform dedicated to the diffusion, edition and alignment of legal vocabularies across countries. Funded in part by the European Commission and administered by the French Ministry of Justice, legivoc offers a seamless path for governments to disseminate their legal foundations and specify semantic bridges between them. We describe the general principles behind the legivoc framework and provide some ideas about its implementation , with a particular focus on the state-of-the-art tools it includes to help crowdsource the alignment of legal corpora together. 1 Introduction legivoc 1 (all in lower case) is an Internet-based database platform dedicated to the management of multiple legal information terminologies, with a particular focus on vocabularies and their alignments [4]. The system is designed to be used both interactively and also as an automated Web service, interoperable with other document management tools or international legislation or translation systems, via a dedicated Application Programing Interface (API). The main goals of legivoc are: (1) to provide access, within a unique framework and using a general formalism, to (ultimately) all the legal vocabularies of the Member States of the European Union; (2) to foster the use of best practices regarding the encoding of these vocabularies using Internet standards such as the Simple Knowledge Organization System (SKOS) and Uniform Resource Identifier (URI); (3) to encourage the creation of alignment information between these vocabularies, helping provide bridges between judicial systems based on different laws and languages. The French Ministry of Justice spearheads the project, partly funded by the Euro-pean Commission and the Ministries of Justice of the Czech Republic, Spain, Finland, France, Italy and Luxembourg. ARMINES and MINES ParisTech are the lead scientific advisors and implementation specialists for the legivoc project. 1 http://legivoc.org (the site is open, although in an " alpha ...
International audience ; legivoc is the first Internet-based platform dedicated to the diffusion, edition and alignment of legal vocabularies across countries. Funded in part by the European Commission and administered by the French Ministry of Justice, legivoc offers a seamless path for governments to disseminate their legal foundations and specify semantic bridges between them. We describe the general principles behind the legivoc framework and provide some ideas about its implementation , with a particular focus on the state-of-the-art tools it includes to help crowdsource the alignment of legal corpora together. 1 Introduction legivoc 1 (all in lower case) is an Internet-based database platform dedicated to the management of multiple legal information terminologies, with a particular focus on vocabularies and their alignments [4]. The system is designed to be used both interactively and also as an automated Web service, interoperable with other document management tools or international legislation or translation systems, via a dedicated Application Programing Interface (API). The main goals of legivoc are: (1) to provide access, within a unique framework and using a general formalism, to (ultimately) all the legal vocabularies of the Member States of the European Union; (2) to foster the use of best practices regarding the encoding of these vocabularies using Internet standards such as the Simple Knowledge Organization System (SKOS) and Uniform Resource Identifier (URI); (3) to encourage the creation of alignment information between these vocabularies, helping provide bridges between judicial systems based on different laws and languages. The French Ministry of Justice spearheads the project, partly funded by the Euro-pean Commission and the Ministries of Justice of the Czech Republic, Spain, Finland, France, Italy and Luxembourg. ARMINES and MINES ParisTech are the lead scientific advisors and implementation specialists for the legivoc project. 1 http://legivoc.org (the site is open, although in an " alpha " version).
International audience ; Content personalization has been one of the major trends in recent Document Engineering Research. The "one docum ent for n users" paradigm is being replaced by the "one user, one document" model, where the content to be delivered to a particular user is generated by some means. This is a very promising approach for e-Government, where personalized government services, including document generation, are more and more required by users. In this paper, we introduce a method to the generation of personalized documents called Document Product Lines (DPL). DPL allows generating content in domains with high variability and with high levels of reuse. We describe the basic principles underlying DPL and show its application to the e-Government field using the personalized tax statement as case study.
International audience ; Content personalization has been one of the major trends in recent Document Engineering Research. The "one docum ent for n users" paradigm is being replaced by the "one user, one document" model, where the content to be delivered to a particular user is generated by some means. This is a very promising approach for e-Government, where personalized government services, including document generation, are more and more required by users. In this paper, we introduce a method to the generation of personalized documents called Document Product Lines (DPL). DPL allows generating content in domains with high variability and with high levels of reuse. We describe the basic principles underlying DPL and show its application to the e-Government field using the personalized tax statement as case study.
International audience ; The huge amount of textual documents that is stored in a lot of domains continues to increase at high speed; there is a need to organize it in the right manner so that a user can access it very easily. Text-Mining tools help to process this growing big data and to reveal the important information embedded in those documents. However, the field of information retrieval in the Arabic language is relatively new and limited compared to the quantity of research works that have been done in other languages (eg. English, Greek, German, Chinese .). In this paper, we propose two statistical approaches of text classification by theme, which are dedicated to the Arabic language. The tests of evaluation are conducted on an Arabic textual corpus containing 5 different themes: Economics, Politics, Sport, Medicine and Religion. This investigation has validated several text mining tools for the Arabic language and has shown that the two proposed approaches are interesting in Arabic theme classification (classification performance reaching the score of 95%).
International audience ; The Terminology Coordination Unit of the European Parliament has an original approach among the Coordination Services of the 10 EU Institutions. It has opened up EU terminology to the science and industry aspects of the field. This presentation will focus on the access to EU terminology that TermCoord offers through a public website and free tools, through the Interinstitutional Terminology Portal EurTerm and its expert collaborators, through traineeship opportunities, study visits, open seminars, the publication of studies, university courses, terminology projects based on the interactive IATE template, participation in major international terminology networks, etc. the constant improvement of the content and functionality of the EU database IATE; including free access to its content , cross-referencing features, as well as TermCoord's cooperation with several universities in order to explore possibilities for adding an ontological dimension to its structure of more than 100 domains.
Opinion and trend mining on micro blogs like twitter recently attracted research interest in several fields including Information Retrieval and Machine Learning. This paper is intended to develop a so-called active learning for automatically annotating French language tweets that deal with the image (i.e., representation, web reputation) of entities : such as politicians, celebrities, companies or brands. Our main contribution is the methodology followed to build and provide an original annotated French data-set expressing opinion on two French politicians over time. Since the performance of natural language processing tasks are limited by the amount and quality of data available to them, one promising alternative for some tasks is the propagation of pseudo-expert annotations. The paper is focused on key issues about active learning while building a large annotated data set, from noise introduced by humans annotators, abundance of data and the label distribution across data and entities.
Opinion and trend mining on micro blogs like twitter recently attracted research interest in several fields including Information Retrieval and Machine Learning. This paper is intended to develop a so-called active learning for automatically annotating French language tweets that deal with the image (i.e., representation, web reputation) of entities : such as politicians, celebrities, companies or brands. Our main contribution is the methodology followed to build and provide an original annotated French data-set expressing opinion on two French politicians over time. Since the performance of natural language processing tasks are limited by the amount and quality of data available to them, one promising alternative for some tasks is the propagation of pseudo-expert annotations. The paper is focused on key issues about active learning while building a large annotated data set, from noise introduced by humans annotators, abundance of data and the label distribution across data and entities.
International audience ; —We address the task of identifying people appearing in TV shows. The target persons are all people whose identity is said or written, like the journalists and the well known people, as politicians, athletes, celebrities, etc. In our approach, overlaid names displayed on the images are used to identify the persons without any use of biometric models for the speakers and the faces. Two identification methods are evaluated as part of the REPERE French evaluation campaign. The first one relies on co-occurrence times between overlay person names and speaker/face clusters, and rule-based decisions which assign a name to each monomodal cluster. The second method uses a Conditionnal Random Field (CRF) which combine different types of co-occurrence statistics and pair-wised constraints to jointly identify speakers and faces.
International audience ; Content personalization has been one of the major trends in recent Document Engineering Research. The "one docum ent for n users" paradigm is being replaced by the "one user, one document" model, where the content to be delivered to a particular user is generated by some means. This is a very promising approach for e-Government, where personalized government services, including document generation, are more and more required by users. In this paper, we introduce a method to the generation of personalized documents called Document Product Lines (DPL). DPL allows generating content in domains with high variability and with high levels of reuse. We describe the basic principles underlying DPL and show its application to the e-Government field using the personalized tax statement as case study.
International audience ; —We address the task of identifying people appearing in TV shows. The target persons are all people whose identity is said or written, like the journalists and the well known people, as politicians, athletes, celebrities, etc. In our approach, overlaid names displayed on the images are used to identify the persons without any use of biometric models for the speakers and the faces. Two identification methods are evaluated as part of the REPERE French evaluation campaign. The first one relies on co-occurrence times between overlay person names and speaker/face clusters, and rule-based decisions which assign a name to each monomodal cluster. The second method uses a Conditionnal Random Field (CRF) which combine different types of co-occurrence statistics and pair-wised constraints to jointly identify speakers and faces.
International audience ; In recent years, European governments and funders, universities and academic societies have increasingly discovered the digital humanities as a new and exciting field that promises new discoveries in humanities research. The funded projects are, however, often ad hoc experiments and stand in isolation from other national and international work. What is lacking is an infrastructure to leverage these pioneering projects into systematic investigations, with methods and technical environments that can be taken up by others. The editors of this special issue are both directors of Digital Research Infrastructure for the Arts and Humanities (DARIAH), which aims to set up a virtual bridge between the many digital arts and humanities projects across Europe. DARIAH is being developed with an understanding that infrastructure does not need to imply a seemingly generic and neutral set of technologies, but should be based upon communities.