[PS-3.52] Predicting crowdsourced idiomatic flexibility judgments from corpus-based statistics

Senaldi, M. S. ^1, ² , Lebani, G. E. ² & Lenci, A. ²

1 Scuola Normale Superiore, Pisa, Italy
2 University of Pisa, Pisa, Italy

Corpus-based and human-elicited data can capture the formal and semantic idiosyncrasy of idiomatic expressions from complementary perspectives, but do they tally and to what extent? To verify it, we first extracted 54 Italian idioms from the La Repubblica corpus and we exploited distributional semantics to measure their compositionality and Shannon entropy to calculate the morphosyntactic flexibility of their verbs and arguments. Participants to a CrowdFlower questionnaire rated on a 1-7 acceptability scale sentences that contained the same 54 idioms in different syntactic variants (base form, adverb insertion, adjectival modification, left dislocation and wh-movement). Hierarchical regression techniques were employed to predict the crowdsourced ratings from our corpus-based indices (including frequency). Principal Component Analysis was carried out on our predictors to avoid multicollinearity. A significant increase in the predicted variance was registered both by inserting the argument-related entropic PCs in our models after the frequency and verb-related entropic PCs (adjusted R-squared change = 0.368, p < 0.001) and vice versa (adjusted R-squared change = 0.148, p < 0.001). The best fitting model consisted thus in a linear combination of frequency and verb-related entropic PCs and argument-related entropic PCs (adjusted R-squared = 0.547, F (3, 50) = 22.32, p < 0.001).