With “big data” come big risks

Cartoon showing people considering crossing the valley of big data

Prebabble: Sound research is backed by the scientific method; it’s measurable, repeatable and reasonable consistent with theory-based hypotheses. Data analysis is a component of scientific research but is not scientific by itself. This article provides examples of how research or summary conclusions can be misunderstood by fault of either the reviewer or the researcher - especially when big data are involved. It is not specific to psychological research, nor is it a comprehensive review of faulty analysis or big data.

When I was a grad student, (and dinosaurs trod the earth) four terminals connected to a mainframe computer were the only computational resources available to about 20 psychology grad students. “Terminal time,” (beyond the sentence that was graduate school) was as precious and competitively sought after as a shaded parking spot in the summer. (I do write from the “Sunshine State” of Florida)

Even more coveted than time at one of the terminals, data from non-academic sources were incredibly desirable and much harder to come by. To gain access to good organization data was the “holy grail” of industrial organizational psychology dissertations. Whenever data were made available, one was not about to look this gift horse in the mouth without making every effort to find meaningful research within those data. Desperate, but crafty grad students could wrench amazing research from rusty data.

But some data are rusted beyond repair.

One of my cell-, I mean class-, mates came into the possession of a very large organizational database. Ordinarily the envy of those of us without data, such was not the case here. It was well known that this individual’s data, though big, were hollow; a whole lot of “zeroes.” To my surprise and concern, this individual seemed to be merrily “making a go of it” with their impotent data. Once convinced that they were absolutely going to follow through with a degree-eligible study (that no one “understood”), sarcasm got the best of me, “Gee, Jeff (identity, disguised), you’ve been at it with those data for some time. Are any hypotheses beginning to shake out of your analyses?”

“Working over” data in hope of finding a reasonable hypothesis is a breach of proper research and clearly unethical whether one knows it or not. But it happens – more today than ever before.

"Big data" has become the Sirens’ song, luring unwitting, (like my grad school colleague) or unscrupulous, prospectors in search of something – anything - statistically significant. But that’s not the way science works. That’s not how knowledge is advanced. That’s just “rack-n-hack” pool where nobody “calls their shots.”

It isn’t prediction if it’s already happened.

The statistical significance (or probability) of any prediction in relation to a given (already known) outcome is always perfect (hence, a “foregone” conclusion). This is also the source of many a superstition. Suppose you win the lottery by betting on your boyfriend’s prison number. To credit your boyfriend’s “prison name” for your winnings would be a mistake (and not just because he may claim the booty). Neither his number nor your choice of it had any influence in determining the outcome – even-though you did win. But if we didn’t care about “calling our shot’s” we’d argue for the impossibly small odds of your winning ticket as determined by your clever means of its choice.

This error of backward reasoning is also known by the Latin phrase, post hoc, ergo propter hoc, or, “after this, therefore because of this.” It’s not veridical to predict a cause from its effect. Unfortunately, the logic may be obvious, but the practice isn’t.

Sophisticated statistical methods can confuse even well-intended researchers who must decide which end of the line to put an arrow on. In addition, the temptation to “rewind the analysis” by running a confirmatory statistical model (i.e., “calling my shot” analysis) AFTER a convenient exploratory finding (i.e., “rack-n-hack” luck) can be irresistible when one’s career is at stake as is frequently the case in the brutal academic world of “publish or perish.” But doing this is more than unprofessional, it’s cheating and blatantly unethical. (Don’t do this.)

Never before has the possibility of bad research making news been so great. Massive datasets are flung about like socks in a locker room. Sophisticated analyses that once required an actual understanding of the math in order to do the programming can now be done as easily as talking to a wish-granting hockey puck named “Alexa.” (“What statistical assumptions?”) Finally, the ease of publishing shoddy “research” results to millions of readers is as easy as snapping a picture of your cat.

All of the aforementioned faux-paus (or worse) concern data “on the table.” The most dubious risk when drawing conclusions from statistical analyses – no matter how ‘big’ the data are – is posed by the data that AREN’T on the table.

A study may legitimately find a statistically significant effect on children’s grades based on time spent watching TV vs playing outdoors. The study may conclude, “When it comes to academic performance, children that play outside significantly outperform those that watch TV.” While this is a true conclusion, the causality of the finding is uncertain.

To further complicate things, cognitive biases work their way into the hornet’s nest of correlation vs causation. In an effort to simplify the burden on our overworked brains, correlation and causation tend to get thrown together in our “cognitive laundry bin.” Put bluntly, correlation is causation.

Although it’s easy to mentally “jump track” from correlation to causation, the opposite move, i.e., from causation to correlation, is not so naturally risky.

Cigarette makers were “Kool” (can I get in trouble for this?) with labeling that claimed an ‘association’ between smoking and a litany of health problems. They were, not-so-Kool with terminology using the word “causes.”

Causal statements trigger a more substantial and lasting mental impression than statements of association. “A causes B” is declarative and signals “finality,” whereas “A is associated with B” is descriptive and signals “probability.” Depending on how a statement of association is positioned, it can very easily evoke an interpretation of causation.

Sometimes obfuscation is the author’s goal, other times it’s an accident or merely coincidental. Both are misleading (at best) when our eyes for big data are bigger than our stomachs for solid research.

Psychways is owned and produced by Talentlift, LLC.

Psychology by Machine? Not for a While.

Psychology button on computer where "Enter" key should be

Technology can fly planes, drive cars; heck, virtually perform remote surgery (pun, not intended). Some believe that literally all jobs, even those that involve deeply personal competencies pertaining to psychology, will eventually be performed by technology. For them, if a “machine” isn’t already doing it, just wait. (Note: This is an extreme view).

Technology is changing the world faster than ever. If you agree with Moore’s law, it will only continue to increase its impact even faster over time.

Will technology take my job?

Probably so, and I don’t deny that likelihood for some aspects of psychology as well. But don’t quit yet! If you’ve been around a few years, like I have, it’s likely that technology has already “taken” all or much of the job you had 10 years ago. You’ve simply changed to stay in front of the technological evolution.

What does science say?

A recent study looked at the rise of technology in relation to the probability of it overtaking more than 700 jobs catalogued in O*Net, a public database of jobs and the various knowledge, skills and abilities required for their performance. The researchers (Frye and Osborne, 2013) reasoned that the probability of technology overtaking a given job is closely related to the time it will take for this to occur. As such, they created a list rank ordering the probability that these 700 jobs will be overtaken by technology in 20 years.

The study is now a few years old, but seems to have already made some accurate predictions. For example, you’ve probably received a “robocall”, a task once was performed by a person.

The crux of the study is in the researchers’ identification of three key job characteristics they refer to as “bottlenecks to computerization.” The degree to which a job encompasses one or more of these “bottlenecks” predicts the probability (and time) required for technology to be able to perform that job. These three bottlenecks include: 1) Fine Perception and Manipulation, 2) Creative Intelligence and 3) Social Intelligence.

Two of these three “bottlenecks” clearly relate to psychology: creative intelligence and social intelligence. But there’s more…

These three bottlenecks were further broken down into seven more discreet tasks. Of these seven tasks, social intelligence encompasses a majority of four – and psychology is integral to social intelligence.

The practical implication is that if your job requires you to “read” people or influence them, particularly in emotional ways, you’re likely safe from seeing a robot at your desk one morning anytime soon.

Specifically, the study predicts that social workers, therapists and teachers should have relatively long careers as far as “automation threat” is concerned. Psychologist, is also in the top 20 of the 700 jobs ranked according to the difficulty of automation.

Although this research is new, the issue isn’t. Psychological assessment has long been a topic of technological debate: Can a personality assessment alone more accurately predict behavior than an expert in psychological assessment?

Continue reading “Psychology by Machine? Not for a While.”