Uncovering the semantics of concepts using GPT-4 and Other recent large language models
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Le Mens, Gaël
- dc.contributor.author Kovács, Balász
- dc.contributor.author Hannan, Michael T.
- dc.contributor.author Pros, Guillem
- dc.contributor.other Universitat Pompeu Fabra. Departament d'Economia i Empresa
- dc.date.accessioned 2024-11-14T10:10:00Z
- dc.date.available 2024-11-14T10:10:00Z
- dc.date.issued 2023-06-02
- dc.date.modified 2024-11-14T10:08:55Z
- dc.description.abstract Recently, the world's attention has been captivated by Large Language Models (LLMs) thanks to OpenAI's Chat-GPT, which rapidly proliferated as an app powered by GPT-3 and now its successor, GPT-4. If these LLMs produce human-like text, the semantic spaces they construct likely align with those used by humans for interpreting and generating language. This suggests that social scientists could use these LLMs to construct measures of semantic similarity that match human judgment. In this article, we provide an empirical test of this intuition. We use GPT-4 to construct a new measure of typicalityâ the similarity of a text document to a concept or category. We evaluate its performance against other model-based typicality measures in terms of their correspondence with human typicality ratings. We conduct this comparative analysis in two domains: the typicality of books in literary genres (using an existing dataset of book descriptions) and the typicality of tweets authored by US Congress members in the Democratic and Republican parties (using a novel dataset). The GPT-4 Typicality measure not only meets or exceeds the current state-of-the-art but accomplishes this without any model training. This is a breakthrough because the previous state-of-the-art measure required fine-tuning a model (a BERT text classifier) on hundreds of thousands of text documents to achieve its performance. Our comparative analysis emphasizes the need for systematic empirical validation of measures based on LLMs: several measures based on other recent LLMs achieve at best a moderate correspondence with human judgments.
- dc.format.mimetype application/pdf*
- dc.identifier https://econ-papers.upf.edu/ca/paper.php?id=1864
- dc.identifier.citation Proceedings of the National Academy of Sciences (PNAS), 120 (49) e2309350120, pp. 1-7 https://doi.org/10.1073/pnas.2309350120
- dc.identifier.uri http://hdl.handle.net/10230/68663
- dc.language.iso eng
- dc.relation.ispartofseries Economics and Business Working Papers Series; 1864
- dc.rights L'accés als continguts d'aquest document queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons
- dc.rights.accessRights info:eu-repo/semantics/openAccess
- dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/es/
- dc.subject.keyword categories
- dc.subject.keyword concepts
- dc.subject.keyword deep learning
- dc.subject.keyword typicality
- dc.subject.keyword gpt
- dc.subject.keyword chatgpt
- dc.subject.keyword bert
- dc.subject.keyword similarity
- dc.title Uncovering the semantics of concepts using GPT-4 and Other recent large language models
- dc.title.alternative
- dc.type info:eu-repo/semantics/workingPaper