Dataset Details
Dataset Description
- Curated by: Dipankar Srirag, Aditya Joshi, Jordan Painter, and Diptesh Kanojia
- Funded by: Google Research Scholar grant
- Shared by: UNSW-NLP Group
- Language: English
- Varieties: Australian English, Indian English, British English
Dataset Sources
Uses
Direct Use
BESSTIE is designed for:
- Sentiment classification across three English varieties: Australian (
en-AU
), Indian (en-IN
), and British (en-UK
). - Sarcasm classification, particularly focusing on the challenges posed by dialectal and cultural variation in these language varieties.
- Evaluating LLM performance across inner-circle (
en-AU
,en-UK
) and outer-circle (en-IN
) English varieties. - Cross-variety and cross-domain generalisation studies for fairness and bias in NLP models.
Out-of-Scope Use
- Not intended for real-time sentiment monitoring in production without adaptation.
- Should not be used for individual profiling or surveillance, especially given the personal nature of some data.
- Not suitable for non-English or non-dialect-specific sentiment/sarcasm studies, as it only targets three English varieties.
- Not ideal for studying neutral sentiment detection, as neutral labels were discarded during annotation.
Dataset Structure
- Languages: English (Australian, Indian, British varieties)
- Data Sources: Google Places reviews (formal), Reddit comments (informal)
- Tasks: Sentiment classification (binary), Sarcasm classification (binary)
- Splits: Train, Validation, Test
- Label Distribution: Stratified for class balance
Dataset Creation
Curation Rationale
To fill a gap in current NLP benchmarks by providing a labeled dataset targeting sentiment and sarcasm across varieties of English, which are often overlooked or underrepresented in LLM evaluation.
Source Data
Data Collection and Processing
- GOOGLE reviews: Collected via Google Places API using location-based filtering in specific cities. Reviews with ratings of 2 or 4 stars were selected.
- REDDIT comments: Collected using subreddit-based filtering for relevant regional discussions.
- Languages and variety detection: FastText was used for English detection; DistilBERT fine-tuned on ICE-Corpora used for variety prediction.
- Filtering: Discarded tourist attraction posts, non-English texts, and anonymized data.
Who are the source data producers?
- Users of Google Maps and Reddit, writing reviews and comments.
- Demographic identities are inferred from context (location/subreddit), not explicitly recorded.
- The data comes from public sources adhering to terms of use.
Annotations
Annotation process
- Annotators were native speakers for each variety.
- Annotated both sentiment and sarcasm using detailed guidelines.
- Labels: Positive (1), Negative (0), Discard (2 for uninformative).
- Sarcasm labeled as sarcastic (1), non-sarcastic (0), or discarded.
- Annotators paid 22 USD/hour.
Who are the annotators?
- Three annotators, including two co-authors, who are native speakers of the respective language varieties.
Personal and Sensitive Information
- User-identifying information (e.g., usernames) was discarded.
- The dataset does not include personal identifiers or health/financial data.
- Sarcasm and sentiment content may contain strong opinions but not sensitive information per se.
Bias, Risks, and Limitations
- Bias: Dataset may reflect regional or platform-specific biases (e.g., Reddit skew).
- Linguistic variation:
en-IN
contains more dialectal features and code-mixing, leading to lower model performance. - Annotation subjectivity: Sentiment and sarcasm are inherently subjective.
- Limited representation: Focus on three national varieties may exclude broader English dialect diversity.
Recommendations
Users should consider:
- Evaluating model generalizability across varieties and domains.
- Using caution when applying the dataset to high-stakes applications.
- Applying fairness-aware training methods to mitigate observed biases, especially toward outer-circle English like
en-IN
.
Citation
BibTeX:
@misc{srirag2025besstie,
title={BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English},
author={Dipankar Srirag and Aditya Joshi and Jordan Painter and Diptesh Kanojia},
year={2025},
eprint={2412.04726},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
APA:
Srirag, D., Joshi, A., Painter, J., & Kanojia, D. (2025). BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English. arXiv preprint arXiv:2412.04726.
Glossary
- Inner-circle varieties: English varieties spoken natively (e.g.,
en-AU
,en-UK
). - Outer-circle varieties: English used as a second language (e.g.,
en-IN
). - P(eng): Probability that a sample is in English (from
fastText
). - P(variety): Probability that a sample matches the intended English variety (from
DistilBERT
-based predictor).
More Information
For dataset access, code, and models, see the [official repository link] (to be added when published).
Dataset Card Authors
Dipankar Srirag, Aditya Joshi, Jordan Painter, Diptesh Kanojia