AI’s sensitivity to selection restriction violation

Selectional restriction, the semantic constraints that verbs put on the following noun, is a great demonstration of humans’ intelligence. There are no specific rules for every existing verb, nor a clear pattern of use, and it even varies according to different languages and cultures. However, it can be mastered by human beings and learn naturally. For example, human knows that “eat” typically selects edible objects. However, for example, in Chinese culture, people also use “eat” sometimes meaning “mastered” or “got something”. Since humans can identify and use these verbs naturally, does an LLM have the ability to do the same? Do LLMs exhibit sensitivity to selectional restriction violations? Does this sensitivity vary by model size or architecture? 

In the following experiment, it was investigated whether LLMs exhibit similar semantic expectations. Since LLMs are trained on massive text corpora, they have sufficient learning that may reflect semantic roles and selectional constraints. If LLMs are sensitive to constraints, then completions should have low surprisal, and violations should have high surprisal, or the difference should be indistinct. 

Three different LLM models were used in the experiment: GPT-2 small (124M parameters), a baseline autoregressive LM; GPT-2 medium (355M parameters), with larger contextual capacity; and Pythia-70M (70M parameters), with a different training corpus. All models were loaded into the experiment’s environment using HuggingFace’s transformers library. Within the experiment, 12 minimal pairs with 24 stimuli total were created, each consisting of a sentence prefix, an expected completion, and an anomalous completion. The stimuli were constructed to contrast valid selectional relations with violations involving verb-object semantic mismatch, agent-instrument implausibility, animacy violations, and physical impossibility. 

During the experiment, for each model and stimulus pair, the next-word probability of the target completion was computed. As a result, this yields higher surprisal for more unexpected words.  The function processes single-token completions by updating the input sequence, which all models were loaded and run using HuggingFace Transformers. 

As a result, the expected completions for all models presented significantly higher surprisal for sentences that violated selectional restriction rules, demonstrating the LLMs’ sensitivity to such violations. The graphs from the experiment illustrated a clear surprisal difference between the expected and anomalous comparisons within each model. Comparing the results, GPT-2 medium shows the largest separation between expected and anomalous surprisal values; GPT-2 small shows a moderate difference; and Pythia-70M has the lowest surprisal value but still shows a distinguishable difference. This suggests that larger models exhibit stronger semantic discrimination, indicating greater capacity for abstract semantic generalization. 

Further on looking at the outcome of the experiment, it is noticeable that LLMs’ sensitivity to semantic violations reflected a distributional learning from language exposure. However, it should be noted that this experiment does have several limitations. For example, the experiment includes a limited set of stimuli and evaluates only the next token prediction in short English phrases. Also, semantic sensitivity may vary depending on the training corpus’s environments. 

In conclusion, LLMs have shown a higher surprisal for selectional-restriction violations, demonstrating that semantic expectations had been learned from distributional data. From the experiment, the model size improves semantic sensitivity, with larger models performing better. Though there might still be noticeable differences compared with humans’ performance, LLMs capture meaningful semantic regularities at a behavioral output level. 


Posted

in

by

Tags:

Comments

Leave a comment