CoNGA: A Corpus of Native Grammar Attrition

Dataset

Description

The Corpus of Native Grammar Attrition (CoNGA) project is an AHRC-funded collaboration between the University of Southampton and the University of York. The project examines potential changes in the native grammars of migrants who have settled in parts of the world where other languages or dialects are spoken. This phenomenon is known as grammatical attrition and can result from extensive contact between two languages/dialects and/or reduced contact with a speaker’s native language/dialect after migration.

The Corpus of Native Grammar Attrition is the first open access oral corpus of potentially attrited native speech, and is freely available through this website.

The corpus consists of recordings of potentially attriting speakers in three different language pairs: 30 L1 Spanish speakers living in the UK (L2 English); 30 L1 German speakers living in the Netherlands (L2 Dutch), and 30 L1a Southern British English speakers living in Belfast, Northern Ireland (L1b Belfast English). The Spanish and German speakers can be considered bilingual whereas the English speakers from the South of England now living in Belfast may be bidialectal speakers.

Participants were recorded taking part in a standard sociolinguistic interview (e.g. Labov 1984) in their first language, with a trained fieldworker. Interviews lasted between 30 minutes and 1 hour and were completed as part of a suite of tasks designed to investigate potential attrition of specific grammatical phenomena.

Monolingual/dialectal speakers living in Germany, Spain or in the South of England were also recorded to act as controls for each of the language pairs, and are included as part of the corpus as well.

The data for the corpus was collected between July 2021 and June 2022. Due to the Covid-19 pandemic, data was collected remotely over Microsoft Teams.

At least 30 minutes of each interview was transcribed following CHAT transcription conventions (MacWhinney 2000). Identifying information (e.g. names and locations) was removed from the transcript and the audio recording. The transcripts were subsequently part-of-speech tagged using the relevant MOR tagger (MacWhinney 2000) and hand-corrected for accuracy.

In total, approximately 75 hours of audio recordings were transcribed, creating an approximately 750,000 word corpus. The recordings and transcripts, as well as part-of-speech tagged files, are all freely available for download.

External deposit with CoNGA, University of Southampton.
Date made available2023
PublisherUniversity of Southampton
Geographical coverageUnited Kingdom

Cite this