|
The corpus, which has been constructed from a selection of existing transcripts of interactions in professional settings, contains two main sub-corpora of a million words each. One sub-corpus consists mainly of academic discussions such as faculty council meetings and committee meetings related to testing. The second sub-corpus contains transcripts of White House press conferences, which are almost exclusively question-and-answer sessions.
The transcripts making up the spoken American corpus have been selected on the basis of being relatively unedited. However, since they have not been produced by linguists, the transcripts do not have all the features one might wish for such as backchannel, pause length, overlap etc. The corpus is annotated with the CLAWS 7 part-of-speech tagset.
This CDROM includes both the tagged and untagged versions of the corpus. You can get a good idea of the content of the untagged version of the corpus by carrying out a search at www.monoconc.com.
|