Art, History, Social Science, Top Stories

Time and Perceptions of Trustworthiness—the Row over a Novel Study

So here you are, head down, truffling along cheerfully towards your morning flat white at the local, lost in thought, wondering what kind of poem Catullus might have written about you, had fortune arranged it so, when some geezer calls out, “cheer up, love, it may never happen.” So infuriating. We make fast and frugal snap judgements about each other all the time and they are often wrong. Much pain in human life is caused by our being over-confident about what she/he meant, intended, thought, or felt. We don’t have direct access to each other’s minds. What we have is language—a frosted or sometimes stained-glass window on to others’ minds—and behaviour. Behaviour includes facial expressions. But their interpretations are error prone. A paper interpreting facial expressions has sparked a recent rumpus.

A September 2020 paper in the prestigious journal Nature Communications has been savaged on Twitter. Small potatoes to those who don’t use the platform, but the authors received tens of thousands of hateful, jeering, or abusive comments that attacked their work, intentions, and characters. The last author, Nicolas Baumard, deleted his Twitter account because of the nastiness. The journal posted “this paper is subject to criticisms that are being considered by the editors.” This sounds ominous especially since we have seen much recent evidence of institutions caving in to criticisms in a way that seems dishonest. I suspect that institutional statements often reflect a desire to quell complainants rather than reflecting the private views of individual decision-makers.

So this must be an extraordinarily stressful and disturbing time for the authors who had reported their innovative approach to a social science question about whether people trust each other more now than they used to, and whether any rise in trust is linked with a rise in prosperity.

Here is my brief analysis of what the authors wanted to know, how they tried to find out, and what their results were.

The research question: The authors are interested in how trust evolves among people. They asked, can we find out whether people trust each other more now than they once did?

The data: Not having a time-machine, which would have made the whole enterprise a lot easier, they lit on the idea of using historical portraits which could be assessed for perceived trustworthiness. They chose portraits of Europeans because the images were readily accessible. They used an initial sample of 1,962 portraits painted between 1505 and 2016, and a replication sample of 4,106 painted between 1360 and 1918.

The key variable: Not trustworthiness, but contemporary assessments of perceived trustworthiness. Trustworthiness describes our behaviour (it is closer to being a stable trait than a fleeting state). The researchers did not have access to whether or not old Mrs Ivor Fondness-Ferpugs from the 16th century village of Woolpack-in the-Wold could be relied upon to pay her butcher. Instead, their key dependent variable was “level of trustworthiness as perceived by an algorithm.”

The novel inclusion was to use an algorithm derived from machine learning that had trained on a set of artificial faces or avatars, to generate a network of points that could be imposed on a face. Artificial faces were manipulated to reflect human-rated “trustworthiness.” The machine learning model worked well in explaining much of the key variance between the artificial avatar faces, but when the algorithm was next tested on human contemporary face databases it worked poorly; it was in low agreement with the human raters (r=.22). At this point the researchers could have revised the design of the experiment and asked humans to rate each of the historical portraits for their perceived trustworthiness, instead of continuing with the machine learning tool.

The algorithm that had been in low agreement with human raters of real faces was then applied to the historical portraits. What was found? Among the key findings, perceptions of trustworthiness increased over time but the effects were very small. Very small effects are worth reporting if they are true. It’s not a good criticism to say of an effect only that it is too small—ask the people who worked on LIGO. What matters is whether or not the small effect is robust. But a small effect that is recovered from a shaky starting point (the low correlation between the human ratings and the algorithm on the training set of human faces) looks undistinguished.

The authors then compare the slope of the change in perceived trustworthiness with various measures such as prosperity as indexed by Gross National Product. They find a small relationship between the slope of increasing prosperity and the slope of increasing perceived trustworthiness over time. They report two null findings–in two further analyses, using other gallery images, the changes in perceived trustworthiness were not linked with the presence of democratic institutions, nor with political democratization. Obviously, people expert in machine learning will bring their richer expertise to bear on this work. There are a couple of things to say about the paper that are more qualitative.

The first concerns the reaction to the paper. I had intended to write something about the Twitter response but it’s such a disheartening task; like watching a bunch of drunk thugs pile in on an innocent passer-by who just walked out of the local library. And Nicolas Baumard and his co-authors are completely innocent of the wrong-doing ascribed to them by the Twitter pile-on which I won’t quote here, it’s too contemptible.

Many tweets accused the authors of racism, yet the evidence is against this criticism. This is a study only of European faces; a sensible choice which needs no defence given the availability of the data. The algorithm which was trained on a set of artificial Caucasian faces was not designed to assess perceived trustworthiness in other populations. It has not been used to assess even perceived trustworthiness in non-European populations, let alone actual trustworthiness which, I repeat, was not the dependent variable.

The authors have been shamed as being latter day phrenologists. Phrenology, invented by Franz Joseph Gall (1758–1828), inspired Cesare Lombroso (1835–1909) who thought that shapes of skulls would reveal criminal tendencies. Lombroso had many wrong-headed ideas and he’s become a byword for a pseudo-scientist, but he wasn’t. He just didn’t know what we know now; scientists in 2150 will laugh at our ignorance too. Lombroso who claimed to have coined the term criminology, was a penal reformer who proselytised in favour of work programmes to help felons re-integrate. His own ideas about phrenology evolved over the course of his research. Using Lombroso as a stick with which to beat Baumard is misconceived; it’s an argument by false analogy. Gall and Lombroso could have been right. They weren’t. Science moves forward by asking questions, trying out new methods, new models, and testing their utility in understanding phenomena.

When we submit papers to journals, we abide by their prescribed word-counts. Journals want concise reports, not Bleak House. An accidental consequence of this is that scientific papers are often very compressed and too telegraphic for readers to comprehend easily. As Steven Pinker sort of said, a curse upon writers who forget how much they know what they mean! The authors knew what they meant when they wrote about trustworthiness. They did not, perhaps, foresee that readers would stumble over the dependent variable like a badly rucked doormat. Clear writing is too often treated like a garnish. It ain’t. It is a main course. To stress the point the it that was being assessed in the historical portraits was not trustworthiness, but assessments of perceived trustworthiness. Such perceptions are ineluctably saturated with the culture of 20th and 21st century raters. It is important to remember this because raters looking at portraits created 500 years ago are looking through a glass darkly.

This bears on the art history aspect of the study. Portraits are prestige objects. What the patron, the sitter, and the painter intend to accomplish has varied tremendously over the last 500 years. What would a contemporaneous rater from the 16th century have said about the perceived trustworthiness of a face in a 19th century portrait? You can just imagine Carpaccio’s next-door neighbour looking askance at Singer Sargent’s Madame X noting “Hussy, I don’t trust her an inch” on the rating slip. The meaning of facial expressions, postures, and gestures changes over time. That’s why it’s often a cringe to watch historical dramas. Even if the dialogue is passable, it’s hard to avoid modern mannerisms. Movies from Hollywood in the 1940s reveal favoured face shapes that differ from today’s and facial expressions that often look dated. We may be more different from our own antecedents from 500 years ago than we are from our contemporaries on the other side of the world in rating-relevant ways. The passage of time needs to be reckoned with.

I was once a model. It didn’t last long. Long enough, though, to get through the unabridged audio book of Bleak House—the one with Dickens’ saccharine heroine, Esther Summerson. She’s virtuous and all, but after the first 137 hours of her submissive, self-sacrificial virtue you really want her to pimp that bodice and bonnet for an evening out with a lowlife and some gin. I sat for my daughter who, as an MFA sculpture major, had to turn a wet slab into my head.

I learned from this sitting that holding a neutral face is easy as pie for the first few minutes. Then it starts to feel as if every inch of your skin is attached by fine wires to a neutron star somewhere near your feet. It’s like being buried in heavy air. We did it every day for a fortnight, till my hollow cheeks and narrow eyes had been captured in the clay.

That was just a so-called neutral face; kindergarten stuff. Smiling is considerably harder to hold; it sets into a leer. Artists don’t like to paint smiles because they rarely look good. Smiles might look more trustworthy in a selfie today because they can be captured quickly when a sensor switches pixels on and off. A smile in an historical portrait is likely to have been somewhat invented by the artist. Extracting ratings of perceived trustworthiness from such data seems shaky.

So have we become more trustworthy? It’s possible. Using assessments of historical portraits is an ambitious and imaginative attempt to find out. I’m not convinced the study did track historical changes in trustworthiness, but the authors report their results fully, are straight-forward about the null results among their nested experiments; they deserve respect and credit for good faith work. Science stalls when we don’t try new approaches. There could be a machine learning tool that throws a net of points over a face which predicts something about a person. The tool in this paper doesn’t look as if it worked well, but it’s a new effort, and was worth a try.

As readers, we are supposed to engage our critical apparatus. But grappling forcefully with adventurous work is different from saying such work should not be done, or that the authors have malign motives. How would one know? Twitter-style impulsive invective is a chilling east wind indeed when it blows through our online discussions. That we cannot rely on courtesy on social and other media platforms suggests we have much work to do in the domain of cultivating real behavioural trustworthiness. A society worth having rests on our willingness to co-operate, to be able to depend a little on the kindness and civility of strangers. There are civil, reasonable comments of support or criticism on Twitter too, but the ratio needs to change. Criticism can be robust, trenchant, hard to swallow, but being mean is qualitatively a different behaviour, surely one that we’d all like to inhibit in ourselves and to see less of in the world at large.

A tweet unconnected with this paper  from the computer scientist, startup guru, essayist, and co-founder of Y Combinator, Paul Graham, could serve as a rallying cry for this tolerant, curious viewpoint:  “Something I taught my sons… one of the very worst things you can do is to disparage the efforts of someone who’s trying to do something new, and hasn’t gotten very far yet. Conversely, one of the best things you can do is to encourage such people.”


Rosalind Arden is a research fellow at the Centre for Philosophy of Natural and Social Science at the London School of Economics. You can follow her on Twitter @Rosalind_Arden_.


  1. Ya think?

  2. I would argue that anyone becoming offended on someone else’s behalf in relation to this issue, are automatically creating greater cause for offence than the researchers in this case. What’s the old saying- with friends like these… The assumption here is that the people being excluded (those from ethnic minority backgrounds), are automatically shrinking violets, incapable of understanding the limitations of a historical study of portraits and require the services of white saviours to come to their rescue for a perceived or implicit slight.

    I would have thought that was far more offensive than the original offence, but then again, the woke aren’t known for their appreciation of humour, satire and irony, given that many have the habit of taking comedians jokes out of context, and representing them in tweets as literal statements.

  3. I’m not familiar with their algorithm or the twitter storm. However, it seems to me that the results of this research have a low probability of being valid. I understand the point about perception of trustworthiness vice actual trustworthiness. That said, the algorithm and any other rating, human or otherwise, has to be based on what drives the perception today. The authors cannot reliably know what drove the perception over time. Surely, it also changes. Consider visible dental work, or piercings or scars or facial hair or…

  4. I have not read the paper concerned but if the authors report of it is accurate it has almost no chance of revealing anything useful. There is the gap between actual perceived trustworthiness and reported perceived trustworthiness,the gap between perceived and actual trustworthiness, a machine learning derived algorithm that is poorly correlated to reported trustworthiness, the application of the algorithm to a data set it was not trained on, the use on portraits with a myriad of distortions from technique, style fashion etc, the use on a self selecting subset (those with portraits etc). Then the results from this are compared to other parameters which changed historically to see if there is a correlation! Any parameter that changes more or less monotonically over the time period is bound to correlate with anything else that changes more or less consistently over the same time period. Trying to infer an actual connection is speculative beyond any sensible bounds.

    So to me hugely excessive over reach.

    The criticism for racism is completely unfair and if sustained would preclude vast areas of research, anything for which the available data is restricted to paticular groups. The paper should have had a hanful of comments admiring the ambition but criticising the results and methodology. The authors should be defended against unfair mobbing.

  5. I’ve not read the article either, but in studies where characteristics of people now (and the ones as perceived of interest now) are compared with those in history, there is one big mistake to avoid: we are now free to show what we are and to express what we feel, but that’s something rather new, and very western. In history (and still with many people around the globe), you behave, look and speak the words you are supposed to, or expected of in your family or community, otherwise big problems.

  6. My own machine algorithm detected that the author was British from the opening line. Why can British people write playfully and with humour on a serious subject? In North America the writing is so dreary.

  7. Hahahaha, JS, you are so right here, but why would it be, indeed? Power yes or no?

  8. but being mean is qualitatively a different behaviour, surely one that we’d all like to inhibit in ourselves and to see less of in the world at large

    No Twitter would improve the world at large. Rewarding impulsive invective (by permitting it’s self-flattering broadcast to “the world”) may one day be seen as the low point it is for technology in society.

    A self-absorbed rant on a streetcorner might reach at most a few dozens of people; but Twitter gives ranters both the chance to be heard by many thousands and the chance to form a self-important and self-righteous ingroup of mobbers.

    “Social” media, like “anti-racism”, is opposite in meaning, for there’s nothing social about twitter mobbing (just as there’s nothing “antiracist” in tagging a group “racist” based on skin color).

  9. “Small potatoes to those who don’t use the platform, but the authors received tens of thousands of hateful, jeering, or abusive comments that attacked their work, intentions, and characters.”

    Clearly, the Covid lockdown has left too many people with too much time (and hate) on their hands. Let’s just remind ourselves that 80% of Americans (for example) do NOT use Twitter. Those tens of thousands are just a sick minority of a minority.

  10. Here one on trustworthiness and racism, anecdote from experience in Kenya in the 1990s.

    At a visit to my usual bar (the only one in the village) I ordered a crate of beer for my weekend with friends to be put in the back of my car. When I wanted to pay the barman, found out I left my wallet at home. “Hakuna matata” (No problem) he said, - then you pay me tomorrow-,
    The scene was followed by others (as it usually is with the single white between many blacks). Another guy now asked the barman to also take home just 3 bottles without payment, which he then would pay the other day. “Hapana…”, nohow, he immediately said, I can’t start with that. But that means, you are a racist, the man yelled (take note here, from one black vs another, in favour of a whitey, btw quite usual in African nations, for understandable reasons). Indeed, you are right, was his prompt answer, I am racist, and with very good reason. I know for 100% sure this man will pay me soon (so he said, better would have been 99% I thought later), whereas I don’t trust you to pay me ever, even if I ask you for it daily. I got a little bit afraid to become the subject of a row, but no reason for that, the whole bar laughed heartily for this reckless little man. How did he dare to ask? With no chance on a positive answer.
    (Ive told this anecdote maybe already before on Quillete, sorry for that).

  11. If only more people understood this!

  12. Your criticisms may be valid as applied to a narrow assessment of the specific outcome, but they strike me as missing the point both of the article author and the researchers.

    The researchers pursued a creative investigation into the use of machine learning applied to human faces: what will we find if we apply this ML to portraits across centuries? Let’s use ‘perceived trustworthiness’ as an assessment criteria. Whatever it might produce in regard to actual changes in trustworthiness of people isn’t really the point.

    For example, that they observed a relationship of change - be that relationship up/down/linear/exponential/monotonic/whatever, as opposed to observing random noise - is itself interesting and can give rise to other questions and investigations: is it due to artifacts in the ML (in which case we might improve the ML), is it due to changes in artistic presentation (in which case something subtle about art expression may be uncovered), etc.

    To my understanding, this is the author’s point: that this sort of creative investigation is stifled by the mobbing by the ridiculous people (in which the mob has presumed both the ‘hidden intent’ of the researchers and nefarious social outcomes of the research) and the institutional caving and cowardice in appeasement of the mob.

  13. Your points are not mutually exclusive though. I agree the mobbing is counterproductive and ridiculous. I also agree that what they “learned” about perceived trustworthiness over time is meaningless because they could only use current, subjective criteria to evaluate perceived norms from centuries ago.

  14. The notion (common these days) that phrenology has no basis in science may not be true. I posted the comments below some time ago.

    GJ, I am somewhat familiar with the digit ratio literature. Yes, it quite real. You might be surprised to learn the recent literature on phrenology is also quite supportive of phrenology. Not skull bumps of course, but somewhat close. See “Deep neural networks are more accurate than humans at detecting sexual orientation from facial images” and “Automated Inference on Criminality using Face Images”,

    The first result didn’t surprise me (or the authors) at all. The dominant theory of homosexuality amounts to “we were born that way” (prenatal hormone exposure). The notion that this (prenatal hormone exposure) would be detectable in adults isn’t (or shouldn’t be) surprising.

    The second result (criminality) is astounding (to me). For the record, the Chinese researchers who found the second result were more than slightly surprised as well. What they (eventually) found was that the AI was picking up on asymmetry. There is a considerable literature suggesting that asymmetry is associated with all manner of ills (figuratively and literally) and apparently criminality as well.

Continue the discussion in Quillette Circle

1 more reply