Large language models (LLM) like GPT-4 can identify a person’s age, location, gender and income with up to 85% accuracy simply by analyzing their social media posts.
Robin Staab And Marc Vero at ETH Zurich in Switzerland, nine LLMs examined a database of Reddit posts and retrieved identifying information in the way users wrote.
Staab and Vero randomly selected 1,500 user profiles who participated in the platform, then narrowed them down to 520 users for whom they could confidently identify attributes such as a person’s place of birth, their income bracket, gender and location, either in their profiles or in their posts. .
When given the posting history of these users, some LLMs were able to identify many of these attributes with a high degree of accuracy. GPT-4 achieved the highest overall accuracy at 85 percent, while LlaMA-2-7b, a relatively low-power LLM, was the least accurate model at 51 percent.
“This tells us that we give out a lot of our personal information on the Internet without thinking about it,” says Staab. “Many people would not assume that one can directly infer their age or location from the way they write, but LLMs are quite capable.”
Sometimes personal information was explicitly stated in the messages. For example, some users post their income on forums offering financial advice. But the AIs also picked up on more subtle cues, like location-specific slang, and could estimate a salary range based on a user’s occupation and location.
Some features were easier for AIs to discern than others. GPT-4 was 97.8% accurate for guessing gender, but only 62.5% accurate for income.
“We are just beginning to understand how privacy can be affected by the use of LLMs,” says Alan Woodwardat the University of Surrey, UK.
The subjects: