Code
LoNER
This repository contains the code for our solution for identifying propaganda techniques in text called LoNER (Long Named Entity Recognition).
The code for building the ML models is available on GitHub. The description of the work can be obtained in the publication.
Style-based credibility classifiers
Two classifiers were developed for detecting low-crediblity online content (such as fake news) based on writing style.:
- using a deep neural network,
- using stylometric features.
The code for both classifiers is available on GitHub, while their detailed description could be found in the article.
Plainifier
Plainifier is a solution for multi-word lexical simplification, using the TerseBERT language model to recursively generate replacement candidates that fit in a given context. These candidates are ranked according to the following criteria:
- Probability, i.e. likelihood according to the language model,
- Similarity, i.e. how much the generated fragment resembles the meaning of the original one, measured by cosine distance of token embeddings,
- Familiarity, i.e. how commonly used the included words are, according to frequency in a large corpus.
Plainifier code is available on GitHub. More information could be found in the publication.
BotHunter
A Twitter bot detection solution for the Bots & Gender Profiling shared task organised at the PAN workshop at CLEF 2019 conference. The code could be downloaded from GitHub. More information is available in the publication.
Finding Reliable Sources
The source code accompanying our study on the task of Finding Reliable Sources (FRS). The challenge of FRS is to, given a short textual claim (e.g. Smoking tobacco is good for your health.), recommend a set of reliable sources (e.g. scholarly papers, publications from established institutions) that could support or refute the claim. The code of the solution could be downloaded from GitHub. More information is available in the article, published at IJCNN 2022.
Corpora and datasets
News Style Corpus
The corpus contains 103,219 documents from 18 credible and 205 non-credible sources selected based on work of PolitiFact and Pew Research Center. The data was gathered to investigate the credibility assessment based on writing style and is available for download from GitHub. More information is available in the article.
NOTE: A new and improved version (v2.0) of this corpus was developed to create the Credibilator browser extension and is available in its repository.
MWLS1
MWLS1 (Multi-Word Lexical Simplification dataset 1) is a dataset for lexical simplification (making text easier to understand by replacing some words), in which both the replaced and replacing fragments can consist of multiple words (up to 3). The data is available for download from GitHub. More information can be found in the article.
NLP Geography
This repository contains code and data for a study into the geography of NLP research, answering question such as: Where do the researchers carry out the work? Where do they present it? What is the environmental cost of travelling to NLP conferences? Do the events attract diverse participation? The anonimised dataset is available from GitHub. The results were presented at the ACL 2022 conference in Dublin and more information could be found described in the article.
FRS Evaluation Datasets
This Zenodo repository includes evaluation datasets that could be used to train and test solution for the problem of Finding Reliable Sources (FRS). Datasets are organised as large collections of individual records, each consisting of (1) a claim expressed a short text fragment and (2) a list of identifiers (DOI, ISBN, arXiv ID or URL) of reliable sources associated with it. More information is available in the article, published at IJCNN 2022.
Wikipedia Complete Citation Corpus
WCCC is a corpus of citations, references and sources mined from the English Wikipedia, shared on Zenodo. WCCC, containing 4.8 million documents with 50.8 million citations of 24.3 million sources, was created as a knowledge base used in a machine-learning model for recommending reliable sources to support (or refute) a given textual claim, but can be used for many other purposes. More information is available in the article, published at IJCNN 2022.
Models
TerseBERT
TerseBERT is a pretrained language model created by fine-tuning BERT. TerseBERT is not only able to predict which word is most likely in a given context (like a regular language model), but if any word is necessary at all. It was created as a component of a text simplification solution described in the article Multi-Word Lexical Simplification. More information about downloading the model and generating it from scratch can be found on GitHub.
Tools
Credibilator
Credibilator is a browser extension that allows a user to see the automatic assessment of credibility of a currently viewed webpage. Within the accompanying article we show that users interacting with the extension were more accurate in spotting fake news. You can download the extension from Chrome Web Store and browse the source code on GitHub.