Open Access BASE2015

Challenges in the linguistic exploitation of specialized republishable web corpora

Abstract

Short paper talk at RESAW 2015 conference (Aarhus, Denmark). ; International audience ; I would like to present work on texts corpora in German, gathered on the Web and processed in order to be made available to linguists and a broader user community via a web interface. The corpora are specialized in the sense that they only address a particular text genre or source at a time. Web crawling techniques are used to download the documents, then they are stored roughly in the way web archives do. More precisely, I would like to talk about two cases where texts are expected to be republishable: a "standard" case, political speeches, and a "borderline" case, German blogs under CC license.The work is performed in the context of a digital dictionary of German. The primary user base consists of lexicographers, who need valuable or at least exploitable evidence, in the form of precise quotes or definition elements.The actual gathering and processing of the corpora is described elsewhere (anonymized references). In this talk I would like to focus on a series of challenges that are to be solved in order to make data from web archives accessible to researchers and to study web text corpora: metadata extraction, quality assurance, licensing, and "scientificity".1. A proper metadata extraction is needed in order to make further downstream applications possible. It has to be performed meticulously, since experience shows that even small or rare mistakes in date encoding for instance may cause the application to be disregarded or discarded by researchers in the humanities, since linguistic trends cannot be identified properly if the content is not ordered in time. Easily available metadata in the case of speeches constrast with different content types, encodings, and markup patterns concerning the blogs. Compromises have to be made without sacrificing recall, since republishable texts are rather rare.2. Regarding the content, quality assurance is paramount, since a high quality is expected by users, all the more since they may ...

Problem melden

Wenn Sie Probleme mit dem Zugriff auf einen gefundenen Titel haben, können Sie sich über dieses Formular gern an uns wenden. Schreiben Sie uns hierüber auch gern, wenn Ihnen Fehler in der Titelanzeige aufgefallen sind.