Last week Mats Rooth and I were awarded a Digging into Data grant by the NSF and SSHRC for the project ”Harvesting Speech Datasets for Linguistic Research on the Web ”, which was one out of 8 projects that were funded. We are working on this project together with a Cornell graduate student, Jonathan Howell.
Here’s a short description: ”Our project develops tools for harvesting data from large-scale speech repositories online in order to gain a better understanding of regularities in the acoustic signal that we all use every day when processing speech. Naturally occurring speech is much more variable than the data we obtain in the lab, so we have to check whether our lab results are valid even in the noisy real world. But certain linguistic constructions that are interesting for theory-testing are often too rare to be found in sufficient numbers in existing speech corpora. We want to increase the scale of the speech data so researchers can complement research in the lab by harvesting relevant spontaneous data from online sources.”
See also the article in the Cornell Chronicle here.