With “big data” come big risks

Prebabble: Sound research is backed by the scientific method; it’s measurable, repeatable and reasonable consistent with theory-based hypotheses. Data analysis is a component of scientific research but is not scientific by itself. This article provides examples of how research or summary conclusions can be misunderstood by fault of either the reviewer or the researcher - especially when big data are involved. It is not specific to psychological research, nor is it a comprehensive review of faulty analysis or big data.

When I was a grad student, (and dinosaurs trod the earth) four terminals connected to a mainframe computer were the only computational resources available to about 20 psychology grad students. “Terminal time,” (beyond the sentence that was graduate school) was as precious and competitively sought after as a shaded parking spot in the summer. (I do write from the “Sunshine State” of Florida)

Even more coveted than time at one of the terminals, data from non-academic sources were incredibly desirable and much harder to come by. To gain access to good organization data was the “holy grail” of industrial organizational psychology dissertations. Whenever data were made available, one was not about to look this gift horse in the mouth without making every effort to find meaningful research within those data. Desperate, but crafty grad students could wrench amazing research from rusty data.

But some data are rusted beyond repair.

One of my cell-, I mean class-, mates came into the possession of a very large organizational database. Ordinarily the envy of those of us without data, such was not the case here. It was well known that this individual’s data, though big, were hollow; a whole lot of “zeroes.” To my surprise and concern, this individual seemed to be merrily “making a go of it” with their impotent data. Once convinced that they were absolutely going to follow through with a degree-eligible study (that no one “understood”), sarcasm got the best of me, “Gee, Jeff (identity, disguised), you’ve been at it with those data for some time. Are any hypotheses beginning to shake out of your analyses?”

“Working over” data in hope of finding a reasonable hypothesis is a breach of proper research and clearly unethical whether one knows it or not. But it happens – more today than ever before.

"Big data" has become the Sirens’ song, luring unwitting, (like my grad school colleague) or unscrupulous, prospectors in search of something – anything - statistically significant. But that’s not the way science works. That’s not how knowledge is advanced. That’s just “rack-n-hack” pool where nobody “calls their shots.”

It isn’t prediction if it’s already happened.

The statistical significance (or probability) of any prediction in relation to a given (already known) outcome is always perfect (hence, a “foregone” conclusion). This is also the source of many a superstition. Suppose you win the lottery by betting on your boyfriend’s prison number. To credit your boyfriend’s “prison name” for your winnings would be a mistake (and not just because he may claim the booty). Neither his number nor your choice of it had any influence in determining the outcome – even-though you did win. But if we didn’t care about “calling our shot’s” we’d argue for the impossibly small odds of your winning ticket as determined by your clever means of its choice.

This error of backward reasoning is also known by the Latin phrase, post hoc, ergo propter hoc, or, “after this, therefore because of this.” It’s not veridical to predict a cause from its effect. Unfortunately, the logic may be obvious, but the practice isn’t.

Sophisticated statistical methods can confuse even well-intended researchers who must decide which end of the line to put an arrow on. In addition, the temptation to “rewind the analysis” by running a confirmatory statistical model (i.e., “calling my shot” analysis) AFTER a convenient exploratory finding (i.e., “rack-n-hack” luck) can be irresistible when one’s career is at stake as is frequently the case in the brutal academic world of “publish or perish.” But doing this is more than unprofessional, it’s cheating and blatantly unethical. (Don’t do this.)

Never before has the possibility of bad research making news been so great. Massive datasets are flung about like socks in a locker room. Sophisticated analyses that once required an actual understanding of the math in order to do the programming can now be done as easily as talking to a wish-granting hockey puck named “Alexa.” (“What statistical assumptions?”) Finally, the ease of publishing shoddy “research” results to millions of readers is as easy as snapping a picture of your cat.

All of the aforementioned faux-paus (or worse) concern data “on the table.” The most dubious risk when drawing conclusions from statistical analyses – no matter how ‘big’ the data are – is posed by the data that AREN’T on the table.

A study may legitimately find a statistically significant effect on children’s grades based on time spent watching TV vs playing outdoors. The study may conclude, “When it comes to academic performance, children that play outside significantly outperform those that watch TV.” While this is a true conclusion, the causality of the finding is uncertain.

To further complicate things, cognitive biases work their way into the hornet’s nest of correlation vs causation. In an effort to simplify the burden on our overworked brains, correlation and causation tend to get thrown together in our “cognitive laundry bin.” Put bluntly, correlation is causation.

Although it’s easy to mentally “jump track” from correlation to causation, the opposite move, i.e., from causation to correlation, is not so naturally risky.

Cigarette makers were “Kool” (can I get in trouble for this?) with labeling that claimed an ‘association’ between smoking and a litany of health problems. They were, not-so-Kool with terminology using the word “causes.”

Causal statements trigger a more substantial and lasting mental impression than statements of association. “A causes B” is declarative and signals “finality,” whereas “A is associated with B” is descriptive and signals “probability.” Depending on how a statement of association is positioned, it can very easily evoke an interpretation of causation.

Sometimes obfuscation is the author’s goal, other times it’s an accident or merely coincidental. Both are misleading (at best) when our eyes for big data are bigger than our stomachs for solid research.

Psyched up?

Like this:

Related

RSVPCancel reply

Share with friends:

Like this:

Related

RSVPCancel reply