BESSTIE

Dataset Details

Dataset Description

Curated by: Dipankar Srirag, Aditya Joshi, Jordan Painter, and Diptesh Kanojia
Funded by: Google Research Scholar grant
Shared by: UNSW-NLP Group
Language: English
Varieties: Australian English, Indian English, British English

Dataset Sources

Paper: Accepted to Findings of ACL 2025

Uses

Direct Use

BESSTIE is designed for:

Sentiment classification across three English varieties: Australian (en-AU), Indian (en-IN), and British (en-UK).
Sarcasm classification, particularly focusing on the challenges posed by dialectal and cultural variation in these language varieties.
Evaluating LLM performance across inner-circle (en-AU, en-UK) and outer-circle (en-IN) English varieties.
Cross-variety and cross-domain generalisation studies for fairness and bias in NLP models.

Out-of-Scope Use

Not intended for real-time sentiment monitoring in production without adaptation.
Should not be used for individual profiling or surveillance, especially given the personal nature of some data.
Not suitable for non-English or non-dialect-specific sentiment/sarcasm studies, as it only targets three English varieties.
Not ideal for studying neutral sentiment detection, as neutral labels were discarded during annotation.

Dataset Structure

Languages: English (Australian, Indian, British varieties)
Data Sources: Google Places reviews (formal), Reddit comments (informal)
Tasks: Sentiment classification (binary), Sarcasm classification (binary)
Splits: Train, Validation, Test
Label Distribution: Stratified for class balance

Dataset Creation

Curation Rationale

To fill a gap in current NLP benchmarks by providing a labeled dataset targeting sentiment and sarcasm across varieties of English, which are often overlooked or underrepresented in LLM evaluation.

Source Data

Data Collection and Processing

GOOGLE reviews: Collected via Google Places API using location-based filtering in specific cities. Reviews with ratings of 2 or 4 stars were selected.
REDDIT comments: Collected using subreddit-based filtering for relevant regional discussions.
Languages and variety detection: FastText was used for English detection; DistilBERT fine-tuned on ICE-Corpora used for variety prediction.
Filtering: Discarded tourist attraction posts, non-English texts, and anonymized data.

Who are the source data producers?

Users of Google Maps and Reddit, writing reviews and comments.
Demographic identities are inferred from context (location/subreddit), not explicitly recorded.
The data comes from public sources adhering to terms of use.

Annotations

Annotation process

Annotators were native speakers for each variety.
Annotated both sentiment and sarcasm using detailed guidelines.
Labels: Positive (1), Negative (0), Discard (2 for uninformative).
Sarcasm labeled as sarcastic (1), non-sarcastic (0), or discarded.
Annotators paid 22 USD/hour.

Who are the annotators?

Three annotators, including two co-authors, who are native speakers of the respective language varieties.

Personal and Sensitive Information

User-identifying information (e.g., usernames) was discarded.
The dataset does not include personal identifiers or health/financial data.
Sarcasm and sentiment content may contain strong opinions but not sensitive information per se.

Bias, Risks, and Limitations

Bias: Dataset may reflect regional or platform-specific biases (e.g., Reddit skew).
Linguistic variation: en-IN contains more dialectal features and code-mixing, leading to lower model performance.
Annotation subjectivity: Sentiment and sarcasm are inherently subjective.
Limited representation: Focus on three national varieties may exclude broader English dialect diversity.

Recommendations

Users should consider:

Evaluating model generalizability across varieties and domains.
Using caution when applying the dataset to high-stakes applications.
Applying fairness-aware training methods to mitigate observed biases, especially toward outer-circle English like en-IN.

Citation

BibTeX:

@misc{srirag2025besstie,
      title={BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English}, 
      author={Dipankar Srirag and Aditya Joshi and Jordan Painter and Diptesh Kanojia},
      year={2025},
      eprint={2412.04726},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

APA:

Srirag, D., Joshi, A., Painter, J., & Kanojia, D. (2025). BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English. arXiv preprint arXiv:2412.04726.

Glossary

Inner-circle varieties: English varieties spoken natively (e.g., en-AU, en-UK).
Outer-circle varieties: English used as a second language (e.g., en-IN).
P(eng): Probability that a sample is in English (from fastText).
P(variety): Probability that a sample matches the intended English variety (from DistilBERT-based predictor).

More Information

For dataset access, code, and models, see the [official repository link] (to be added when published).

Dataset Card Authors

Dipankar Srirag, Aditya Joshi, Jordan Painter, Diptesh Kanojia

Dataset Details#

Dataset Description#

Dataset Sources#

Uses#

Direct Use#

Out-of-Scope Use#

Dataset Structure#

Dataset Creation#

Curation Rationale#

Source Data#

Data Collection and Processing#

Who are the source data producers?#

Annotations#

Annotation process#

Who are the annotators?#

Personal and Sensitive Information#

Bias, Risks, and Limitations#

Recommendations#

Citation#

Glossary#

More Information#

Dataset Card Authors#

Dataset Card Contact#