Data Similarity in Classification and Fictitious Training Data Generation
In: Operations Research Proceedings 2008, S. 395-400
11798 Ergebnisse
Sortierung:
In: Operations Research Proceedings 2008, S. 395-400
In: (2023) 109 Iowa Law Review, Forthcoming
SSRN
SSRN
In: ACM journal on computing and sustainable societies, Band 2, Heft 1, S. 1-18
ISSN: 2834-5533
A growing body of work has focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of highly resourced languages causing detection systems to either under-perform or not exist in limited data contexts. This is mostly caused by a lack of training data, which are expensive to collect and curate in these settings. In this work, we propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts using synthetic data generation techniques. Given a handful of hate speech examples in a high-resource language such as English, we present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets. We apply our approach to generate training data for hate speech classification tasks in Hindi and Vietnamese. Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain. This method can be adopted to bootstrap hate speech detection models from scratch in limited data contexts. As the growth of social media within these contexts continues to outstrip response efforts, this work furthers our capacities for detection, understanding, and response to hate speech.
Disclaimer:
This work contains terms that are offensive and hateful. These, however, cannot be avoided due to the nature of the work.
In: Teaching sociology: TS, Band 18, Heft 1, S. 123
ISSN: 1939-862X
In: Advances in Data Analysis and Classification, 2020
SSRN
In: The review of socionetwork strategies, Band 16, Heft 2, S. 479-492
ISSN: 1867-3236
Sensor networks have drawn much attention because of their promising applications in environmental monitoring, seismology, and military surveillance. Despite increasing interest, sensor network research is still in its initial phase. Few real systems have been deployed and little data is available to test proposed protocol and data management designs. Most sensor network research to date uses randomly generated data input to simulate their systems. Some researchers have proposed using environmental monitoring data obtained from remote sensing or in-situ instrumentation. In many cases, neither of these approaches is relevant, because they are either collected from regular grid topology, or too coarse grained. This paper proposes to use synthetic data generation techniques to generate irregular data topology from data sets measured on a grid. To tackle this problem, we investigate the use of the available sparsely sampled data sets, model the spatio-temporal correlation in these data sets, and generate irregular topology data based on empirical models of the experimental data. Our goal is to more realistically evaluate sensor network system designs before large scale field deployment. In obtaining these synthetic data sets, we draw heavily on techniques developed in geo-statistics and other spatial interpolation techniques, but appropriately modify them for the application at hand. Our evaluation results on the radar data set of weather observations shows that the spatial correlation of the original and synthetic data are similar. Moreover, visual comparison shows that the synthetic data retains interesting properties (e.g., edges) of the original data. Our case study on the DIMENSIONS system demonstrates how synthetic data helps to evaluate the system over an irregular topology, and points out the need to improve the algorithm.
BASE
In the UK, genomic health data is being generated in three major contexts: the healthcare system (based on clinical indication), in large scale research programmes, and for purchasers of direct-to-consumer genetic tests. The recently delivered hybrid clinical/research programme, 100,000 Genomes Project set the scene for a new Genomic Medicine Service, through which the National Health Service aims to deliver consistent and equitable care informed by genomics, while providing data to inform academic and industry research and development. In parallel, a large scale research study, Our Future Health, has UK Government and Industry investment and aims to recruit 5 million volunteers to support research intended to improve early detection, risk stratification, and early intervention for chronic diseases. To explore how current models of genomic health data generation intersect, and to understand clinical, ethical, legal, policy and social issues arising from this intersection, we conducted a series of five multidisciplinary panel discussions attended by 28 invited stakeholders. Meetings were recorded and transcribed. We present a summary of issues identified: genomic test attributes; reasons for generating genomic health data; individuals' motivation to seek genomic data; health service impacts; role of genetic counseling; equity; data uses and security; consent; governance and regulation. We conclude with some suggestions for policy consideration.
BASE
In the UK, genomic health data is being generated in three major contexts: the healthcare system (based on clinical indication), in large scale research programmes, and for purchasers of direct-to-consumer genetic tests. The recently delivered hybrid clinical/research programme, 100,000 Genomes Project set the scene for a new Genomic Medicine Service, through which the National Health Service aims to deliver consistent and equitable care informed by genomics, while providing data to inform academic and industry research and development. In parallel, a large scale research study, Our Future Health, has UK Government and Industry investment and aims to recruit 5 million volunteers to support research intended to improve early detection, risk stratification, and early intervention for chronic diseases. To explore how current models of genomic health data generation intersect, and to understand clinical, ethical, legal, policy and social issues arising from this intersection, we conducted a series of five multidisciplinary panel discussions attended by 28 invited stakeholders. Meetings were recorded and transcribed. We present a summary of issues identified: genomic test attributes; reasons for generating genomic health data; individuals' motivation to seek genomic data; health service impacts; role of genetic counseling; equity; data uses and security; consent; governance and regulation. We conclude with some suggestions for policy consideration.
BASE
Software: Practice & Experience, 42(11):1331-1362 ; Automatic test data generation is a very popular domain in the field of search-based software engineering. Traditionally, the main goal has been to maximize coverage. However, other objectives can be defined, such as the oracle cost, which is the cost of executing the entire test suite and the cost of checking the system behavior. Indeed, in very large software systems, the cost spent to test the system can be an issue, and then it makes sense by considering two conflicting objectives: maximizing the coverage and minimizing the oracle cost. This is what we did in this paper. We mainly compared two approaches to deal with the multi-objective test data generation problem: a direct multi-objective approach and a combination of a mono-objective algorithm together with multi-objective test case selection optimization. Concretely, in this work, we used four state-of-the-art multi-objective algorithms and two mono-objective evolutionary algorithms followed by a multi-objective test case selection based on Pareto efficiency. The experimental analysis compares these techniques on two different benchmarks. The first one is composed of 800 Java programs created through a program generator. The second benchmark is composed of 13 real programs extracted from the literature. In the direct multi-objective approach, the results indicate that the oracle cost can be properly optimized; however, the full branch coverage of the system poses a great challenge. Regarding the mono-objective algorithms, although they need a second phase of test case selection for reducing the oracle cost, they are very effective in maximizing the branch coverage. ; Spanish Ministry of Science and Innovation and FEDER under contract TIN2008-06491-C04-01 (the M project). Andalusian Government under contract P07-TIC-03044 (DIRICOM project).
BASE
This paper proposes a hybrid model (HyM)for a heating, ventilation and air conditioning (HVAC) system installed in a passenger train. This HyM fuses data from two sources: data taken from the real system and synthetic data generated using a physics-based model of the HVAC. The physical model of the HVAC was developed to include the sensors located in the real system and new virtual sensors reproducing the behaviour of the system while a failure mode (FM) is simulated. Statistical features are calculated from the selected signals. These features are labelled according to the related FMs and are merged with the features calculated from the data from the real system. This data fusion allows us to classify the condition indicators of the system according to the FMs. The merged features are used to train a neural network (NN), which achieves a remarkable accuracy. Accuracy is a key concern of future research on the detection and diagnosis of a multiple faults and the estimation of the remaining useful life (RUL) through prognosis. The outcome is beneficial for the proper functioning of the system and the safety of the passengers. ; Finanssiär: Basque Government (KK-2020/0004); ISBN för värdpublikation: 978-92-990084-6-1
BASE
In: Asian journal of research in social sciences and humanities: AJRSH, Band 6, Heft 12, S. 277
ISSN: 2249-7315
In: Journal of privacy and confidentiality, Band 11, Heft 3
ISSN: 2575-8527
This paper describes PrivBayes, a differentially private method for generating synthetic datasets that was used in the 2018 Differential Privacy Synthetic Data Challenge organized by NIST.
In: Urban Planning, Band 1, Heft 2, S. 88-100
In this paper we outline the methodological development of current research into urban community formations based on combinations of qualitative (volunteered) and quantitative (spatial analytical and geo-statistical) data. We outline a research design that addresses problems of data quality relating to credibility in volunteered geographic information (VGI) intended for Web-enabled participatory planning. Here we have drawn on a dual notion of credibility in VGI data, and propose a methodological workflow to address its criteria. We propose a 'super-positional' model of urban community formations, and report on the combination of quantitative and participatory methods employed to underpin its integration. The objective of this methodological phase of study is to enhance confidence in the quality of data for Web-enabled participatory planning. Our participatory method has been supported by rigorous quantification of area characteristics, including participant communities' demographic and socio-economic contexts. This participatory method provided participants with a ready and accessible format for observing and mark-making, which allowed the investigators to iterate rapidly a system design based on participants' responses to the workshop tasks. Participatory workshops have involved secondary school-age children in socio-economically contrasting areas of Liverpool (Merseyside, UK), which offers a test-bed for comparing communities' formations in comparative contexts, while bringing an under-represented section of the population into a planning domain, whose experience may stem from public and non-motorised transport modalities. Data has been gathered through one-day participatory workshops, featuring questionnaire surveys, local site analysis, perception mapping and brief, textual descriptions. This innovative approach will support Web-based participation among stakeholding planners, who may benefit from well-structured, community-volunteered, geo-located definitions of local spaces.