Time and Perceptions of Trustworthiness—the Row over a Novel Study

Time and Perceptions of Trustworthiness—the Row over a Novel Study

Rosalind Arden
Rosalind Arden
8 min read

So here you are, head down, truffling along cheerfully towards your morning flat white at the local, lost in thought, wondering what kind of poem Catullus might have written about you, had fortune arranged it so, when some geezer calls out, “cheer up, love, it may never happen.” So infuriating. We make fast and frugal snap judgements about each other all the time and they are often wrong. Much pain in human life is caused by our being over-confident about what she/he meant, intended, thought, or felt. We don’t have direct access to each other’s minds. What we have is language—a frosted or sometimes stained-glass window on to others’ minds—and behaviour. Behaviour includes facial expressions. But their interpretations are error prone. A paper interpreting facial expressions has sparked a recent rumpus.

A September 2020 paper in the prestigious journal Nature Communications has been savaged on Twitter. Small potatoes to those who don’t use the platform, but the authors received tens of thousands of hateful, jeering, or abusive comments that attacked their work, intentions, and characters. The last author, Nicolas Baumard, deleted his Twitter account because of the nastiness. The journal posted “this paper is subject to criticisms that are being considered by the editors.” This sounds ominous especially since we have seen much recent evidence of institutions caving in to criticisms in a way that seems dishonest. I suspect that institutional statements often reflect a desire to quell complainants rather than reflecting the private views of individual decision-makers.

So this must be an extraordinarily stressful and disturbing time for the authors who had reported their innovative approach to a social science question about whether people trust each other more now than they used to, and whether any rise in trust is linked with a rise in prosperity.

Here is my brief analysis of what the authors wanted to know, how they tried to find out, and what their results were.

The research question: The authors are interested in how trust evolves among people. They asked, can we find out whether people trust each other more now than they once did?

The data: Not having a time-machine, which would have made the whole enterprise a lot easier, they lit on the idea of using historical portraits which could be assessed for perceived trustworthiness. They chose portraits of Europeans because the images were readily accessible. They used an initial sample of 1,962 portraits painted between 1505 and 2016, and a replication sample of 4,106 painted between 1360 and 1918.

The key variable: Not trustworthiness, but contemporary assessments of perceived trustworthiness. Trustworthiness describes our behaviour (it is closer to being a stable trait than a fleeting state). The researchers did not have access to whether or not old Mrs Ivor Fondness-Ferpugs from the 16th century village of Woolpack-in the-Wold could be relied upon to pay her butcher. Instead, their key dependent variable was “level of trustworthiness as perceived by an algorithm.”

The novel inclusion was to use an algorithm derived from machine learning that had trained on a set of artificial faces or avatars, to generate a network of points that could be imposed on a face. Artificial faces were manipulated to reflect human-rated “trustworthiness.” The machine learning model worked well in explaining much of the key variance between the artificial avatar faces, but when the algorithm was next tested on human contemporary face databases it worked poorly; it was in low agreement with the human raters (r=.22). At this point the researchers could have revised the design of the experiment and asked humans to rate each of the historical portraits for their perceived trustworthiness, instead of continuing with the machine learning tool.

The algorithm that had been in low agreement with human raters of real faces was then applied to the historical portraits. What was found? Among the key findings, perceptions of trustworthiness increased over time but the effects were very small. Very small effects are worth reporting if they are true. It’s not a good criticism to say of an effect only that it is too small—ask the people who worked on LIGO. What matters is whether or not the small effect is robust. But a small effect that is recovered from a shaky starting point (the low correlation between the human ratings and the algorithm on the training set of human faces) looks undistinguished.

The authors then compare the slope of the change in perceived trustworthiness with various measures such as prosperity as indexed by Gross National Product. They find a small relationship between the slope of increasing prosperity and the slope of increasing perceived trustworthiness over time. They report two null findings–in two further analyses, using other gallery images, the changes in perceived trustworthiness were not linked with the presence of democratic institutions, nor with political democratization. Obviously, people expert in machine learning will bring their richer expertise to bear on this work. There are a couple of things to say about the paper that are more qualitative.

The first concerns the reaction to the paper. I had intended to write something about the Twitter response but it’s such a disheartening task; like watching a bunch of drunk thugs pile in on an innocent passer-by who just walked out of the local library. And Nicolas Baumard and his co-authors are completely innocent of the wrong-doing ascribed to them by the Twitter pile-on which I won’t quote here, it’s too contemptible.

Many tweets accused the authors of racism, yet the evidence is against this criticism. This is a study only of European faces; a sensible choice which needs no defence given the availability of the data. The algorithm which was trained on a set of artificial Caucasian faces was not designed to assess perceived trustworthiness in other populations. It has not been used to assess even perceived trustworthiness in non-European populations, let alone actual trustworthiness which, I repeat, was not the dependent variable.

The authors have been shamed as being latter day phrenologists. Phrenology, invented by Franz Joseph Gall (1758–1828), inspired Cesare Lombroso (1835–1909) who thought that shapes of skulls would reveal criminal tendencies. Lombroso had many wrong-headed ideas and he’s become a byword for a pseudo-scientist, but he wasn’t. He just didn’t know what we know now; scientists in 2150 will laugh at our ignorance too. Lombroso who claimed to have coined the term criminology, was a penal reformer who proselytised in favour of work programmes to help felons re-integrate. His own ideas about phrenology evolved over the course of his research. Using Lombroso as a stick with which to beat Baumard is misconceived; it’s an argument by false analogy. Gall and Lombroso could have been right. They weren’t. Science moves forward by asking questions, trying out new methods, new models, and testing their utility in understanding phenomena.

When we submit papers to journals, we abide by their prescribed word-counts. Journals want concise reports, not Bleak House. An accidental consequence of this is that scientific papers are often very compressed and too telegraphic for readers to comprehend easily. As Steven Pinker sort of said, a curse upon writers who forget how much they know what they mean! The authors knew what they meant when they wrote about trustworthiness. They did not, perhaps, foresee that readers would stumble over the dependent variable like a badly rucked doormat. Clear writing is too often treated like a garnish. It ain’t. It is a main course. To stress the point the it that was being assessed in the historical portraits was not trustworthiness, but assessments of perceived trustworthiness. Such perceptions are ineluctably saturated with the culture of 20th and 21st century raters. It is important to remember this because raters looking at portraits created 500 years ago are looking through a glass darkly.

This bears on the art history aspect of the study. Portraits are prestige objects. What the patron, the sitter, and the painter intend to accomplish has varied tremendously over the last 500 years. What would a contemporaneous rater from the 16th century have said about the perceived trustworthiness of a face in a 19th century portrait? You can just imagine Carpaccio’s next-door neighbour looking askance at Singer Sargent’s Madame X noting “Hussy, I don’t trust her an inch” on the rating slip. The meaning of facial expressions, postures, and gestures changes over time. That’s why it’s often a cringe to watch historical dramas. Even if the dialogue is passable, it’s hard to avoid modern mannerisms. Movies from Hollywood in the 1940s reveal favoured face shapes that differ from today’s and facial expressions that often look dated. We may be more different from our own antecedents from 500 years ago than we are from our contemporaries on the other side of the world in rating-relevant ways. The passage of time needs to be reckoned with.

I was once a model. It didn’t last long. Long enough, though, to get through the unabridged audio book of Bleak House—the one with Dickens’ saccharine heroine, Esther Summerson. She’s virtuous and all, but after the first 137 hours of her submissive, self-sacrificial virtue you really want her to pimp that bodice and bonnet for an evening out with a lowlife and some gin. I sat for my daughter who, as an MFA sculpture major, had to turn a wet slab into my head.

I learned from this sitting that holding a neutral face is easy as pie for the first few minutes. Then it starts to feel as if every inch of your skin is attached by fine wires to a neutron star somewhere near your feet. It’s like being buried in heavy air. We did it every day for a fortnight, till my hollow cheeks and narrow eyes had been captured in the clay.

That was just a so-called neutral face; kindergarten stuff. Smiling is considerably harder to hold; it sets into a leer. Artists don’t like to paint smiles because they rarely look good. Smiles might look more trustworthy in a selfie today because they can be captured quickly when a sensor switches pixels on and off. A smile in an historical portrait is likely to have been somewhat invented by the artist. Extracting ratings of perceived trustworthiness from such data seems shaky.

So have we become more trustworthy? It’s possible. Using assessments of historical portraits is an ambitious and imaginative attempt to find out. I’m not convinced the study did track historical changes in trustworthiness, but the authors report their results fully, are straight-forward about the null results among their nested experiments; they deserve respect and credit for good faith work. Science stalls when we don’t try new approaches. There could be a machine learning tool that throws a net of points over a face which predicts something about a person. The tool in this paper doesn’t look as if it worked well, but it’s a new effort, and was worth a try.

As readers, we are supposed to engage our critical apparatus. But grappling forcefully with adventurous work is different from saying such work should not be done, or that the authors have malign motives. How would one know? Twitter-style impulsive invective is a chilling east wind indeed when it blows through our online discussions. That we cannot rely on courtesy on social and other media platforms suggests we have much work to do in the domain of cultivating real behavioural trustworthiness. A society worth having rests on our willingness to co-operate, to be able to depend a little on the kindness and civility of strangers. There are civil, reasonable comments of support or criticism on Twitter too, but the ratio needs to change. Criticism can be robust, trenchant, hard to swallow, but being mean is qualitatively a different behaviour, surely one that we’d all like to inhibit in ourselves and to see less of in the world at large.

A tweet unconnected with this paper  from the computer scientist, startup guru, essayist, and co-founder of Y Combinator, Paul Graham, could serve as a rallying cry for this tolerant, curious viewpoint:  “Something I taught my sons… one of the very worst things you can do is to disparage the efforts of someone who’s trying to do something new, and hasn’t gotten very far yet. Conversely, one of the best things you can do is to encourage such people.”

ArtHistoryTop Stories