R. L. Starr

Predicting NPs in Vernacular Written Cantonese


One of the fun things about Cantonese is that it has a number of possible NP constructions (bare noun, classifier-noun, etc.) which may be used for indefinite and indefinite reference. Like "there was a guy" vs. "there was this guy" in English, the choice of NP form in Cantonese correlates with pragmatic factors rather than having strict semantic functions. Which discourse factors influence the choice of form? Accessibility theory holds that the form of an NP will reflect the ease with which its referent can be retrieved. I decided to investigate how NP forms may best be predicted, for both new items first being introduced into the discourse and given items, taking into account factors that have been implicated in accessibility theory, including topicality, syntactic function, frequency, distance from previous reference, and previous NP form.

To investigate the role of discourse factors in predicting NP form, I needed to put together a corpus of vernacular Cantonese. I wanted to look at data in a narrative genre so I could easily track entities as they were repeatedly brought up in the discourse. Since I couldn't locate any existing corpus that was sufficiently vernacular and narrative-ish, I created a new corpus called Mail2Love. The Mail2Love corpus is made up of posts to the Mail2Love message board, a love advice forum hosted on the Yes! magazine website yes.com.hk (Yes! is a Hong Kong lifestyle magazine). The posts I collected from Mail2Love worked very nicely because they were consistent narratives of problems with boyfriends or potential boyfriends, all written in very colloquial HK Cantonese. I hand-tagged every NP in the corpus, including the zeroes, for a bunch of information (distance from previous reference, syntactic role, presence of modifying clause, etc). If you'd like to take a look at the corpus, let me know!

So far, I've found that topicality is a very good predictor of NP form, but not in the direction we might expect. For new items, more topical items are more likely to be introduced with a longer NP form. This is not unexpected given previous results in Mandarin (Sun 1988, Li 2000) and the Gernsbacher & Schroyer (1989) study of indefinite 'this'. For given items, however, accessibility predicts that more topical items will be associated with less linguisitc material; in fact, the opposite is observed in the Mail2Love data. This same pattern is observed for previous NP form; a longer previous form correlates positively with a longer current NP form, rather than the reverse.