Suchergebnisse
Filter
4 Ergebnisse
Sortierung:
OdiEnCorp 2.0
In: http://hdl.handle.net/11234/1-3211
Data --- We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel corpora such as OdiEnCorp 1.0 and PMIndia, and books which contain both English and Odia text such as grammar and bilingual literature books. We also included parallel text from multiple public websites such as Odia Wikipedia, Odia digital library, and Odisha Government websites. The parallel corpus covers many domains: the Bible, other literature, Wiki data relating to many topics, Government policies, and general conversation. We have processed the raw data collected from the books, websites, performed sentence alignments (a mix of manual and automatic alignments) and released the corpus in a form suitable for various NLP tasks. Corpus Format --- OdiEnCorp 2.0 is stored in simple tab-delimited plain text files, each with three tab-delimited columns: - a coarse indication of the domain - the English sentence - the corresponding Odia sentence The corpus is shuffled at the level of sentence pairs. The coarse domains are: books . prose text dict . dictionaries and phrasebooks govt . partially formal text odiencorp10 . OdiEnCorp 1.0 (mix of domains) pmindia . PMIndia (the original corpus) wikipedia . sentences and phrases from Wikipedia Data Statistics --- The statistics of the current release are given below. Note that the statistics differ from those reported in the paper due to deduplication at the level of sentence pairs. The deduplication was performed within each of the dev set, test set and training set and taking the coarse domain indication into account. It is still possible that the same sentence pair appears more than once within the same set (dev/test/train) if it came from different domains, and it is also possible that a sentence pair appears in several sets (dev/test/train). Parallel Corpus Statistics --- Dev Dev Dev Test Test Test Train Train Train Sents # EN # OD Sents # EN # OD Sents # EN # OD books 3523 42011 36723 3895 52808 45383 3129 40461 35300 dict 3342 14580 13838 3437 14807 14110 5900 21591 20246 govt - - - - - - 761 15227 13132 odiencorp10 947 21905 19509 1259 28473 24350 26963 704114 602005 pmindia 3836 70282 61099 3836 68695 59876 30687 551657 486636 wikipedia 1896 9388 9385 1917 21381 20951 1930 7087 7122 Total 13544 158166 140554 14344 186164 164670 69370 1340137 1164441 "Sents" are the counts of the sentence pairs in the given set (dev/test/train) and domain (books/dict/.). "# EN" and "# OD" are approximate counts of words (simply space-delimited, without tokenization) in English and Odia The total number of sentence pairs (lines) is 13544+14344+69370=97258. Ignoring the set and domain and deduplicating again, this number drops to 94857. Citation --- If you use this corpus, please cite the following paper: @inproceedings{parida2020odiencorp, title={OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation}, author={Parida, Shantipriya and Dash, Satya Ranjan and Bojar, Ond{\v{r}}ej and Motlicek, Petr and Pattnaik, Priyanka and Mallick, Debasish Kumar}, booktitle={Proceedings of the WILDRE5--5th Workshop on Indian Language Data: Resources and Evaluation}, pages={14--19}, year={2020} }
BASE
Ottoman-Southeast Asian Relations: sources from the Ottoman Archives
In: Handbook of Oriental studies. Section 1 the Near and Middle East volume 133
"Ottoman-Southeast Asian Relations: Sources from the Ottoman Archives, is a product of meticulous study of İsmail Hakkı Kadı, A.C.S. Peacock and other contributors on historical documents from the Ottoman archives. The work contains documents in Ottoman-Turkish, Malay, Arabic, French, English, Tausung, Burmese and Thai languages, each introduced by an expert in the language and history of the related country. The work contains documents hitherto unknown to historians as well as others that have been unearthed before but remained confined to the use of limited scholars who had access to the Ottoman archives. The resources published in this study show that the Ottoman Empire was an active actor within the context of Southeast Asian experience with Western colonialism. The fact that the extensive literature on this experience made limited use of Ottoman source materials indicates the crucial importance of this publication for future innovative research in the field. Contributors are: Giancarlo Casale, Annabel Teh Gallop, Rıfat Günalan, Patricia Herbert, Jana Igunma, Midori Kawashima, Abraham Sakili and Michael Talbot"--
Maltese-English parallel corpus MaCoCu-mt-en 1.0
In: http://hdl.handle.net/11356/1525
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable efforts were devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and BicleanerAI (https://github.com/bitextor/bicleaner-ai) and Bifixer (https://github.com/bitextor/bifixer) were used for fixing, cleaning, and deduplicating the final version of the corpus. While the TXT format consists solely of pairs of source and target segments (one or several sentences), each segment pair in the TMX format is accompanied by the following metadata: - source and target document URL; - quality score as provided by the tool BicleanerAI; - translation direction identification: the source segment in each segment pair was identified by using a probabilistic model; - personal information identification ("biroamer-entities"): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - language variants: the language variant of English (British or American) was identified for every segment pair on document and domain level. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author's view. The Agency is not responsible for any use that may be made of the information it contains.
BASE