International audience ; The emergence of Spatial Humanities has prompted for interdisciplinary work on digitized texts, especially since the significance of place names exceeds the usually admitted frame of deictic and indexical functions. In this perspective, I present a visualization of toponyms co-occurrences in the literary journal Die Fackel ("The Torch"), published by the satirist and language critic Karl Kraus in Vienna from 1899 until 1936. The distant reading experiments consist in drawing lines on maps in order to uncover patterns which are not easily retraceable during close reading. I discuss their status in the context of a digital humanities study. This is not an authoritative cartography of the work but rather an indirect depiction of the viewpoint of Kraus and his contemporaries. Drawing on Kraus' vitriolic recording of political life, toponyms in Die Fackel tell a story about the ongoing reconfiguration of Europe.
International audience ; The present German political speeches corpus follows from a initial release which has been used in various research contexts. This article documents an updated and extended version: as 2017 marks the end of a legislative period, the corpus now includes the four highest ranked functions on federal state level. Besides providing a citable reference for this resource, the main contributions are (1) an extensive description of the corpus to be released and (2) the description of an interface to navigate through the texts, designed for researchers beyond the corpus and computational linguistics communities as well as for the general public. The corpus can be considered to be from the 21st century since most speeches have been written after 2001 and also because it includes a visualiza-tion interface providing synoptic overviews ordered chronologically, by speaker or by keyword as well as consequent accesses to the texts.
International audience ; The emergence of Spatial Humanities has prompted for interdisciplinary work on digitized texts. In this paper, I present a visualization of toponyms co-occurrences in the satirical literary magazine "Die Fackel" ("The Torch"), published by the satirist and language critic Karl Kraus in Vienna from 1899 until 1936. I set out on a distant reading experiment which consists in drawing lines on the map in order to uncover patterns which are not easily retraceable during close reading. The first map displays unfiltered lines of thought, whereas the second one grounds on a qualitatively refined analysis. I also discuss their status in the context of a digital humanities study. The maps are to be released as additional feature to the existing digital edition. Software and gazetteer are available under open-source licenses. Drawing on Kraus' vitriolic recording of political life, toponyms in Die Fackel tell a story about the ongoing reconfiguration of Europe.
International audience ; This study focuses on isolated error detection in a retro-digitized newspaper corpus published from 1946 to 1990 in the former German Democratic Republic. As there are OCR errors throughout the corpus but no clean reference for this variant of German, automatic OCR correction implies to overcome data sparseness and non-standard spelling, including compounds and inflected forms. The contributions of this paper are (1) a method to bootstrap detection of potential misspellings, (2) an assessment of several types of training data, and (3) an evaluation of several off-the-shelf candidate selection techniques. The chosen solution based on statistical affix analysis reaches an accuracy 10 points higher than existing morphological analysis systems on error detection, while a combination of fuzzy and approximate string search performs best for error correction. The criteria are met since it is possible to correct erroneous tokens without introducing too much noise.
Short paper talk at RESAW 2015 conference (Aarhus, Denmark). ; International audience ; I would like to present work on texts corpora in German, gathered on the Web and processed in order to be made available to linguists and a broader user community via a web interface. The corpora are specialized in the sense that they only address a particular text genre or source at a time. Web crawling techniques are used to download the documents, then they are stored roughly in the way web archives do. More precisely, I would like to talk about two cases where texts are expected to be republishable: a "standard" case, political speeches, and a "borderline" case, German blogs under CC license.The work is performed in the context of a digital dictionary of German. The primary user base consists of lexicographers, who need valuable or at least exploitable evidence, in the form of precise quotes or definition elements.The actual gathering and processing of the corpora is described elsewhere (anonymized references). In this talk I would like to focus on a series of challenges that are to be solved in order to make data from web archives accessible to researchers and to study web text corpora: metadata extraction, quality assurance, licensing, and "scientificity".1. A proper metadata extraction is needed in order to make further downstream applications possible. It has to be performed meticulously, since experience shows that even small or rare mistakes in date encoding for instance may cause the application to be disregarded or discarded by researchers in the humanities, since linguistic trends cannot be identified properly if the content is not ordered in time. Easily available metadata in the case of speeches constrast with different content types, encodings, and markup patterns concerning the blogs. Compromises have to be made without sacrificing recall, since republishable texts are rather rare.2. Regarding the content, quality assurance is paramount, since a high quality is expected by users, all the more since they may ...
International audience ; The resource presented here consists of speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources. It provides raw data, metadata and tokenized text with part-of-speech tagging and lemmas in XML TEI format for researchers that are able to use it and a simple visualization interface for those who want to get a glimpse of what is in the corpus before downloading it or thinking about using more complete tools. The visualization output is in valid CSS/XHTML format, it takes advantage of recent standards. The purpose is to give a sort of Zeitgeist, an insight on the topics developed by a government official and on the evolution in the use of general concepts. This resource is freely available: http://purl.org/corpus/german-speeches
International audience ; The resource presented here consists of speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources. It provides raw data, metadata and tokenized text with part-of-speech tagging and lemmas in XML TEI format for researchers that are able to use it and a simple visualization interface for those who want to get a glimpse of what is in the corpus before downloading it or thinking about using more complete tools. The visualization output is in valid CSS/XHTML format, it takes advantage of recent standards. The purpose is to give a sort of Zeitgeist, an insight on the topics developed by a government official and on the evolution in the use of general concepts. This resource is freely available: http://purl.org/corpus/german-speeches
International audience ; The resource presented here consists of speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources. It provides raw data, metadata and tokenized text with part-of-speech tagging and lemmas in XML TEI format for researchers that are able to use it and a simple visualization interface for those who want to get a glimpse of what is in the corpus before downloading it or thinking about using more complete tools. The visualization output is in valid CSS/XHTML format, it takes advantage of recent standards. The purpose is to give a sort of Zeitgeist, an insight on the topics developed by a government official and on the evolution in the use of general concepts. This resource is freely available: http://purl.org/corpus/german-speeches
Ich möchte mit meinen Mitteln zu einer Annäherung zwischen technischer Umgebung und philosophischen Fragen beitragen. Es handelt sich um Begriffserklärungen (Technik, Werkzeuge und Instrumente, Technowissenschaft), um einen Überblick der Grundthemen (der Stand der Dinge auf geschichtlicher, politischer und wissenschaftlicher Ebene) und letztlich um Beispiele, die die dargestellten Überlegungen mit heutigen Fakten und Debatten in Verbindung setzen.
Ich möchte mit meinen Mitteln zu einer Annäherung zwischen technischer Umgebung und philosophischen Fragen beitragen. Es handelt sich um Begriffserklärungen (Technik, Werkzeuge und Instrumente, Technowissenschaft), um einen Überblick der Grundthemen (der Stand der Dinge auf geschichtlicher, politischer und wissenschaftlicher Ebene) und letztlich um Beispiele, die die dargestellten Überlegungen mit heutigen Fakten und Debatten in Verbindung setzen.
Ich möchte mit meinen Mitteln zu einer Annäherung zwischen technischer Umgebung und philosophischen Fragen beitragen. Es handelt sich um Begriffserklärungen (Technik, Werkzeuge und Instrumente, Technowissenschaft), um einen Überblick der Grundthemen (der Stand der Dinge auf geschichtlicher, politischer und wissenschaftlicher Ebene) und letztlich um Beispiele, die die dargestellten Überlegungen mit heutigen Fakten und Debatten in Verbindung setzen.
International audience ; Following the assumption that the tech blog sphere represents an avant-garde of technologically and socially interested experts, we describe an experimental setting to observe its input on the public discussion of matters situated at the intersection of technology and society. Our interdisciplinary approach consists in joining forces on a common base of texts and tools. This cooperation stems from work on the impact of digital media on democratic processes and institutions (GHI/RRCHNM) and corpus and computational linguistics (BBAW). The major aims of the effort described here are twofold: (1) compiling a text base (for German and English) from a curated list of blogs dedicated to technological topics for lexicographical and linguistic research, as well as (2) conducting exemplary studies using the compiled corpus, focusing on specific research questions regarding public discourse in Germany and the United States on questions of internet policy.
Abstract This paper analyzes the internet policy discourse regarding the German Network Enforcement Act (NetzDG) in different media settings. We examine the conversation about this highly controversial anti-hate speech law on IT blogs, websites, and in daily German newspapers. We compare the positions brought forward in these different media environments concerning one of the most important topics within the discussion about the NetzDG, specifically the question of whether or not the law will result in censorship, limiting users' freedom of expression. We employ pretrained transformer-based language models to detect and quantify recurring arguments in the debate.