Resources

Code

Style-based credibility classifiers

Two classifiers were developed for detecting low-crediblity online content (such as fake news) based on writing style.:

  • using a deep neural network,
  • using stylometric features.

The code for both classifiers is available on GitHub, while their detailed description could be found in the article.

Plainifier

Plainifier is a solution for multi-word lexical simplification, using the TerseBERT language model to recursively generate replacement candidates that fit in a given context. These candidates are ranked according to the following criteria:

  • Probability, i.e. likelihood according to the language model,
  • Similarity, i.e. how much the generated fragment resembles the meaning of the original one, measured by cosine distance of token embeddings,
  • Familiarity, i.e. how commonly used the included words are, according to frequency in a large corpus.

Plainifier code is available on GitHub. More information could be found in the publication.

BotHunter

A Twitter bot detection solution for the Bots & Gender Profiling shared task organised at the PAN workshop at CLEF 2019 conference. The code could be downloaded from GitHub. More information is available in the publication.

Corpora and datasets

News Style Corpus

The corpus contains 103,219 documents from 18 credible and 205 non-credible sources selected based on work of PolitiFact and Pew Research Center. The data was gathered to investigate the credibility assessment based on writing style and is available for download from GitHub. More information is available in the article.

MWLS1

MWLS1 (Multi-Word Lexical Simplification dataset 1) is a dataset for lexical simplification (making text easier to understand by replacing some words), in which both the replaced and replacing fragments can consist of multiple words (up to 3). The data is available for download from GitHub. More information can be found in the article.

Models

TerseBERT

TerseBERT is a pretrained language model created by fine-tuning BERT. TerseBERT is not only able to predict which word is most likely in a given context (like a regular language model), but if any word is necessary at all. It was created as a component of a text simplification solution described in the article Multi-Word Lexical Simplification. More information about downloading the model and generating it from scratch can be found on GitHub.