Big Data, Small Tool

I’m as susceptible as anyone to the allure of big data.  We all want to know how things work, and data visualization gives us a sense of knowing, without all the work of actually learning.

As consultants, we spend a lot of time extracting useful insights from data.  The data is often inconsistent or incomplete, and rarely is it structured to permit the kinds of analysis we apply to it.  Naturally we love it when we’re presented with an abundance of good data.  We’re even more excited when it comes with its own analytic tools.

When a colleague showed me Google Books Ngram Viewer yesterday  I got a tingly feeling all over.  It was like being given a key to a vault full of secrets.  In it I would find answers to questions I hadn’t thought to ask yet, which is after all, the essential promise of big data.

Ngram lets you see and compare the frequency with which words and phrases appear in the language and rise or fall in importance.  It instantly charts the results across a span of centuries sometimes capturing the entire life span of a word.  The data behind these charts is the billions of words in the millions of books that Google has acquired.  What fun!

As a matter of principle, I ignored all forms of help or direction Google offered.  I’m sure Ngram has features and capabilities I’ve missed, so I can’t offer anything more than my novice impression.  I started by playing with place names and historic figures.  The first lesson I learned was that the printed word, which was the only mass medium prior to radio and television, is very limited as a means of documenting some more recent aspects of social change.  So, when I juxtaposed “Snoop Dogg,” with “Martin Luther King,” I saw that Snoop hardly registers at all on the language.

Of course the reality is that electronic media has picked up where publishing left off.  In reality, Snoop’s name may not be cited less often than Reverend King’s, it’s simply cited in different places.   Judging by frequency of mention in media other than print over the past decade, it might appear that King is the dwarf and Snoop is the giant.  Google provides an example on the Ngram site which clearly shows that Frankenstein is better represented in print than is Einstein.

If I’m right, this explains why MacKenzie King appears as a giant in print, despite the immense stature and historical significance of more recent Canadian Prime Ministers.  “Real time” and restrospective writing about King’s longevity, peculiarity, and policies seems to skyrocket as historians digest the consequences of his war-time leadership.  But by the time Lester Pearson takes office, television is starting to take over political reportage and documentary programs.   Despite his global significance as a Nobel Peace Prize winner and his prominence in broadcast media at home, Pearson’s Ngram profile is tiny when compared with MacKenzie King’s.


Though I’m no historian, all King’s successors are less prominent in print than he was. Despite saturation reporting on historic moments like the War Measures Act, Repatriation of the Constitution, and the Free Trade Agreement, modern Prime Ministers aren’t cited as frequently in print as he was.

The difference is so pronounced that an explanation is required, but all I’ve got are hypotheses.  King, who was appointed Deputy Minister of Labour in 1900, and who left office in 1948, left a long trail through his record in government for historians to follow.  He was also a prolific diarist who left 50,000 pages behind.   This inordinately rich body of work may have spawned a cottage industry for historians who studied and wrote about him, and about each other’s views of him, for a generation after his death in 1950.

Another of my theories is that Ngram doesn’t take into account how widely its texts have been distributed and read.  Ngram doesn’t count the total number of references, it calculates the ratio of references to all other words in the collection.  It also treats each published work as if only one copy exists, whereas the impact on readers would vary according to the number of copies being circulated among readers.  It’s possible that the total volume of published works during the time of Trudeau and Mulroney was vastly greater, in terms of total words and in terms of copies in print.  If these differences could be factored into a comparison of these leaders, a very different picture might emerge.

Nevertheless, the frequency of words used to express what’s on the minds of writers, and by extension, their publishers, and by further extension, the reading public, gives us some useful clues about where our attention has been focused as a society.  So, despite having learned more about what the tool couldn’t do, I stuck with it, I stuck with it, still certain it would improve my understanding of something or other.

If Ngram couldn’t reflect the influences of broadcast and digital media on the print industry, I thought it could at least shed some light on relationships between items of vocabulary that are commonly linked in common parlance.  My mind idiotically jumped to an old Sinatra song.  The lyrics go, “love and marriage, love and marriage, they go together like a horse and carriage… this I tell you, brother, you can’t have one without the other….

“Is that so?” I wondered, “and how would Sinatra know?”  According to Ngram, horses have always been featured more prominently than carriages in the written word, however both drop steadily after 1900, about the same time as the word “automobile” makes its appearance in the language.

This result was so clear, I thought I’d try it on other sunset industries and emergent technologies.  But when I tried “buggy whip,” and, “Google Glass, all I learned was that nobody ever really wrote or thought much about buggy whips except business advisors trying to sound clever about obsolescence.

Back to the song, I next tried “love,” and, “marriage,” sure I’d hit a home run.  Disappointingly, there is no apparent relationship, even when I tightened the time frame to post-1940.  I tried adding the word, “divorce,” and still found relatively little variance in the frequency of mention.  I then pulled up some actual divorce rate statistics and saw that they vary in relation to socially disruptive events, such as world wars or economic disasters, more than they appear to be influenced by writings about love.  Duh!

Thus I’d found another way Ngram couldn’t help me.  Apparent relationships between words don’t mean much outside the context of social, political, and economic change.  I’d started out thinking this tool would answer questions I hadn’t thought to ask yet, but all it had given me so far was more questions.  Still I was having fun.

Clearly I was asking too much of this application.  I’d never get anything from it unless I simplified the question, so this is what I looked at next:  Women make up half the population, and it’s become a media commonplace to say that many prefer chocolate to sex.  If this is true, there should be a lot of “chocerotica” or “chocornography” in the burgeoning archives of chick-lit, shouldn’t there?

Well, it’s not so, according to Ngram.  Sex has been top of mind for ages, judging by word occurrences, whereas chocolate has been of far less interest than some women would have us think.

Or how about this?   I’ve heard more than once that a sneeze is physiologically comparable to an orgasm.  If that’s true, wouldn’t they get roughly equal treatment in the literary world?  Given how highly we rate orgasms on the scale of human experience, both should be top of mind among writers (except among those that prefer chocolate).


Instead what we discover is that the word “orgasm” appeared less frequently in print than the word “sneeze,” until the mid-30’s, and didn’t begin it’s great ascent until the end of WWII.  Its relative popularity in print seems to have subsided somewhat at about the same point in history when North American doctors began diagnosing Caposi’s Sarcoma in young men in the late ’70’s, signalling the start of the HIV/AIDS epidemic, though this may be pure coincidence.  In any case, in print as in life, the sneeze never caught on in quite the same way as the orgasm, as the Ngram proves.

Men are accused of thinking about sex and, by implication, orgasms, every minute of the day, so it would also be logical to expect that the word “orgasm” would occur more frequently than printed descriptions of many other every day experiences.  Yet it’s surprising how low both “orgasm,” and “sneeze,” rank when compared to other intensely gratifying experiences and practices such as, say, “prayer.”


Really?  Do we really write this much more about prayer than we do about orgasm?  Thank God!  There’s hope for humankind yet.

Let me conclude with an example of Ngram at its finest.  I started trying words that are actually opposed to each other, not words that are incidentally paired like love and marriage, horse and carriage, sex and chocolate, orgasm and sneeze, or names like Snoop Dog and Dr. Martin Luther King.  I wanted to see words entering the lexicon and fading away as one directly impacts another.

I chose the words, “racism,” and “nigger,”  because one is obviously used by people who are unlikely to use the other with comparable frequency.  Even an academic or policy document is likely to express a point of view favouring the use of one over the other.  Look what happens when they’re compared head to head:


I find it fascinating that “nigger” appears in the printed word over a century before “racism” arises with any frequency.  The moment “racism” starts appearing in publication, “nigger” goes into decline from its historic high point around 1940.

Again, it’s tempting to speculate why “nigger” keeps its place in the language.  It could be that black activists appropriated the politically potent term around that time, keeping it alive until rap artists cemented it in the language as an honorific.  I really don’t know, but I’m curious.

Still, I was startled and impressed to see how consciousness of race, and presumably discrimination and inequality, leaps from the printed page in 1940, continually gaining in power until the present day.  In ordinary, every day language, use of “racism” has eclipsed the detestable N-word, which now serves mainly as artifact, valued most by comedians and celebrity gangstas, who keep us appropriately uncomfortable about the many more persistent manifestations of racism still at work around us.

Anyway, hours after I started working with this labour-saving, analytic tool, I finally feel I learned something from Ngram.  I’m tempted tempted to drill deeper into the language of race, among some of the other topics I touched along the way.  I’m sure there are many other cases where Ngram can tell us when one word was vanquished or replaced by another, providing clues about what’s really on our minds when all our talk is about sex, love, and chocolate.

But I haven’t got time for that now, and neither do you.  Back to work, everyone.

[Update: oh wait, this just came in: “14 Google Tools You Didn’t Know Existed”]

Comments are closed.