The data paradox

December 12, 2005 | Uncategorized

In a public-policy context, data availability and reliability is a paradox:

 

Two_doctors_grim

 

“I can’t believe he’s starting off with that hoary pun.”

 

In affordable housing, as in many areas of public policy, effective program design depends on having quality information available to policy-makers, and that leads us to the first data paradox:

 

Two_doctors_walking 

“What do you look like without the Groucho Marx glasses?”

 

1.         Data is valuable only when publicly available, meaning free

 

As my old college classmate Richard Stallman first preached several decades ago, “information wants to be free.”  And while I do not extend that reasoning as far as Richard does (few do), there is no question that:

 

Information is more broadly valuable when it is freely available.

 

This is especially true of housing.  Housing is physical property: identifiable spaces that are valuable only when occupied by identified people who pay quantifiable sums to buy, rent, maintain, improve, or operate it.  So everything that we care to know about housing demand, housing supply, housing finance, is or can be readily quantifiable.  Knowing the data at the level of gritty detail — city, neighborhood, block — is the difference being flying blind and navigating from a GPS map.

 

Related to this reality is the demonstrable truth that when information circulates more rapidly, aggregate market activity expands. 

 

Arrayed against the force of light (transparency) is the selfish interest in information advantage.  When data is known only to a few, that information advantage is huge, especially in financial market arenas.  Ever since Victorian banker James Rothschild’s famous complaint that with the telegraph, “anyone can get the news,” making their profit “before the Bourse closes the same day,” market traders have sought to hoard information for competitive advantage.

 

James_de_rothschild 

“How can I maintain advantage when everybody knows what I know?”

 

Data compilations, preferably in active manipulable form (that is, electronic databases) are essential to modern finance.  Statistical compilations, indices, comparatives, risk curves, yield curves, contact lists, contact networks — all these have great policy benefit but if kept proprietary enable some market movers to surf the wave but add no value to the ecosystem.

 

Thereby is the first rub — data is societally valuable only when free (or in the collaborative space in business-speak), but is privately valuable when restricted (the competitive space). 

 

Which leads to the second paradox, indeed there is a pair o’ paradoxes:

 

Two_doctors_peering 

“Yes, blogging has destroyed his entire cerebral cortex.”

 

2.         Data is very costly to assemble

 

Even as data is valuable only when free:

 

Useful data is costly to assemble.

 

What makes data useful versus useless?  Accuracy.  Without reliability, statistics are deceptive, conclusions unsupported. 

 

Unwarranted data is mere noise.

 

Two_doctors_in_surgery 

“No doubt about it, blogging rots the brain.”

 

But accuracy in data is not free; it costs something.  The best warranty of accuracy is real-time automatic collection (think temperature, appropriate as I sit here on a wintry morning), or your check book balance.  Further, both of these examples have several important features that make them easy data to collect:

 

  1. Intrinsic numerical quantity.  They report in numbers.  Temperature is a continuum but we have thermometers for that.
  2. Standardization.  All thermometers are calibrated in either Fahrenheit or centigrade.
  3. External verifiability.  I claim the temperature is 55; you glance at the thermometer and refute me.
  4. Easy real-time collection.  Hook up a thermometer to a recorder and you can produce a continuous record of temperature change, and with scarcely an effort you can digitize it into an electronic database.

 

Temperature’s a good illustration of the further phenomenon that certain collection has to be done contemporaneously or the observation opportunity is permanently lost.  (Global warming: isn’t it, is it, or how bad is it?  Records of temperatures before about 1800 are fraught with uncertainty.)

 

Now let’s shift to something more practical in a housing context: prices. 

 

Can we capture home prices?  Our first problem is the lack of real-time collection, since houses sell (have observable data) infrequently; our second problem is standardization (lack of sameness and difference across many variables, some of them of major impact). 

 

Once we lose standardization, the intrinsic numerical quantity is at risk, in that we no longer know what is a ‘comparable.’  Judgment enters in — and however virtuous, expert, and careful our judge, judgment introduces variability, or in mathematical terms, potential error.

 

So our data becomes much less reliable.  (Even the Economist has trouble quantifying what’s going on with home prices.)

 

By this reasoning, rents are somewhat easier to specify: they are more frequently sampled (typically monthly), more prone to standardization (multiple similar apartments in the same property), and more readily verifiable (’shop the property’).  So it should be no surprise that rental data is fairly readily sampled, even as there is difficulty producing meaningful fair market rent statistics across a metropolitan area (leading to flaws in FMR calculations).

 

Let’s take one more step: operating expenses.

 

At first blush, this should be no more difficult than rents: they are paid monthly, they are intrinsically quantitative (you write the checks), and in many areas they are standardized (electricity is the same regardless of whose space you’re heating or lighting).  But operating expenses are not publicly reported, and there goes one’s utility.  Each operator chooses to hoard its data, and such studies as one obtains (IREM, ULI, National Apartment Association), are all compilations of data that is self-reported and unverified.

 

The operating expense case is a particularly poignant example of the data paradox.  For over a decade my company has bought these studies, paying in the range of $125 a pop for them, even though I know full well that any given entry has a large potential error in it, for several reasons:

 

  • It’s better than nothing.  Self-selected or not, self-reported or not, some actual effort went into the compilation and there was a good-faith effort to standardize.  Each of the associations did some actual work on the task.
  • It may provide useful spatial comparisons.  If all the data is subject to the same external forces and vagaries, then maybe at least we can make comparisons between markets (e.g. Austin is such a percentage more expensive than San Antonio).
  • It gradually builds up a time series.  If the same kind souls report their properties year after year, we may be able to extract trends across reports.

 

So with this last point, we also observe something else about data: each data element becomes more valuable if it is placed within a large data library, all of whose entries are reliable information.  Data becomes more honorable as it associates with quality fellows.

 

On the flip side, data that associates with lowlife imprecision becomes tainted by association. 

 

Two_doctors_white_computer 

“Can you really remove spam by bleaching the monitor?”

 

Wikipedia, the on-line anonymously compiled encyclopedia of everything, is extraordinarily successful, and very frequently linked and referenced, but many folks categorically will not rely on it for citations for the simple reason that anyone can edit and thus, now and then an entry is proved to be massively, perhaps even libelously, wrong.

 

This brings us to the third data paradox:

 

Two_doctors_pointing_concern 

“See, right here is where he makes fun of doctors.”

 

3.         No one has yet demonstrated a viable business model for policy data

 

Herewith the third paradox:

 

Three_stoges_doctors

“Oh, now you’re stooping really low …”

 

There is no viable business model for data dissemination

 

If the two preceding paradoxes are true — it’s costly to assemble, and valuable only when free — then the third follows inevitably.  Who cannot afford to devote resources to compiling valuable data?  Who makes money giving value away?

 

Dr_Strangelove_3 

“What is the logarithm of giving away data?”

 

There are well-established business models to sell data:

 

  • Reproduction-proof limited circulation.  Few readers will remember the stock-picking newsletters, typically printed on copy-proof red stock (which made reading them a torture).
  • Time-dependent updates.  Michael Bloomberg (whatever happened to him?) made a billion creating a restricted-access real-time update service.  Like Rothschild’s customers a hundred and fifty years back, people will pay huge sums to know sooner, and the amount sooner has shrunk as information moves faster.

Key to these information-selling business models is that (paraphrasing Britain’s SAS) who pays, knows.  Who doesn’t pay, doesn’t know.  In the established business models, information is born free but is everywhere in chains, rendering it useless for public policy.  Paraphrasing Dr. Strangelove:

 

Of course, the whole point of a public-policy database is lost, if you keep it a secret! Why didn’t you tell the world, eh?

 

Strangelove_talking_cig 

“Why didn’t you tell the world, eh?” 

 

The internet has destroyed that business model (as it has destroyed so many others): with high bandwidth, data can be converted to electrons and sent everywhere, to everybody, instantaneously.  So many a pay-per-view model (e.g. the old print newspaper) is under enormous business pressure because the cost of competitors (like bloggers!) is rapidly dropping to zero and the quality premium for warranted data is being eroded by a steady drip of embarrassing hoaxes and corrections (from Stephen Glass and Jayson Blair onward).

 

Facing the trio of data paradoxes, how do we get useful data into the public arena?

 

That, , the subject of a future post, I may charge you to find out.

 

Dr_Strangelove_1 

 

Send post as PDF to www.pdf24.org