Sunday, December 16, 2012

What Percent Are You?

See discussion after the interactive...

(Note: If the chart goes blank it means that you have hit a group that is too small to create a valid distribution for. Just choose different attributes to continue.)

Though interest in the "99% movement" seems to have subsided (see Google Trends below), I have been fascinated by the calculators it spawned.

First there was this simple calculator from the Wall Street Journal (clicking the image takes you to the calculator at

This is very easy to use, but it pertains to "tax-filing units" which aren't necessarily individuals, and it doesn't allow for any adjustments to account for things like age, sex, race, education, and geography which are known to effect income distribution.

Then we got this wonderful interactive infographic from the New York Times (clicking the image takes you to the calculator at

If you enter a household income it allows you to select from 344 geographic areas (including the entire US) using a map, and then shows the percentile of the given income plus the full income distribution. However, this pertains to households and, while it does let you adjust for geography (very elegantly), it doesn't allow for adjustments for age, sex, etc.

I just came across, which pools 5 years of data and uses Google Maps to show media household income at the census tract level. Very cool! But it is based on households, not individuals and doesn't show the whole income distribution or allow for any demographic adjustments.

However, I wanted to be able to compare myself to other individuals like myself. Not tax-filing units or households, but individuals. That means adjustments for the age, sex, race, education, and geography of individuals. The US Census Bureau's American Community Survey has the data for this kind of comparison, and I've been very curious to try out Tableau's free public offering

I used the two to come up with the interactive above, based on 2011 data (the most recent year available) for individuals with earnings (and there were about 157M of us). According to the ACS documentation:

"Earnings are defined as the sum of wage or salary income and net income from self-employment. Earnings represent the amount of income received regularly for people 16 years old and over before deductions for personal income taxes, Social Security, bond purchases, union dues, Medicare deductions, etc. An individual with earnings is one who has either wage/salary income or self-employment income, or both."

While my interactive doesn't have a map, it does allow you to adjust for the age, sex, race, education, and geography of individuals.

Friday, October 26, 2012

Health Systems

At long last my paper with Prof. Dhar at NYU Stern on data mining in healthcare has been published in the journal Health Systems. Abstract here:

Friday, November 25, 2011

Global Correlation of Education with Economic Productivity

Another Stern HW piece I thought was interesting and worth posting.  This one is about the correlation of education with productivity at the country-level. 

Here I look at two data series, both from the World Bank's World Development Indicators Database.  The first is GDP per capita based on PPP in constant 2005 international dollars (NY.GDP.PCAP.PP.KD), and the second is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the tertiary level of education (SE.TER.ENRR). Tertiary education, regardless of whether it leads to an advanced research qualification, normally requires as a minimum condition of admission the successful completion of education at the secondary level.  Both series start to have meaningful numbers of country-level measurements in 1971.

I start by plotting the regression in 1971 of GDP per capita on percent enrolled in tertiary education. 

This regression line shows that, in 1971, each percentage point increase in the percent enrolled in tertiary education is associated with an additional $547.84 in GDP per capita.  (Red dots are OECD nations.) The R-square statistic also shows that a significant amount of the variation in GDP per capita is explained by variation in the percent enrolled in tertiary education.

However, by 2008, the last year available in the series, the level of correlation had changed.

In 2008 the regression line is much flatter and each percentage point increase in the percent enrolled in tertiary education is only associated with an additional $193.78 in GDP per capita.  The R-square statistic also shows that, by 2008, far less of the variation in GDP per capita is explained by the percent enrolled in tertiary education.

I ran the regressions for each year from 1971-2008 and plotted those lines on a single chart.  A very clear pattern emerges.

The correlation of education with productivity at the country-level appears to be decreasing over time.  Of course correlation and causation are not the same thing - but given the emphasis the economic development community places on education, this would seem to be a pattern worth exploring further. 

Monday, October 3, 2011


I posted earlier about some work I've been doing with Prof. Dhar at Stern, and some of it is finally coming to fruition. We'll be presenting it at the University of Maryland Robert H. Smith School of Business' Workshop on Health IT and Economics - slotted for section 4.3.

Thursday, August 25, 2011

Hotspots for Healthcare Savings

A fascinating story came up on PBS' Frontline in July.  Dr. Jeffrey Brenner started analyzing billing data for healthcare utilization in Camden, NJ.  He was able to detect "hotspots" of high healthcare utilization.  He found that 1% of Camden's residents account for 30% of its healthcare costs and that 5% of its residents account for 60% of the costs.  By focusing extra attention on these high utilizers Dr. Brenner has been able to cut their costs by 40%-50%.  And the story goes on to ask the obvious question: what if this could be applied to the entire nation?

As you can see from previous posts I've been playing around data from the Medical Expenditure Panel Survey (MEPS) and I thought it would be interesting to see if these same kinds of "hotspots" could be found nationally.  In fact, MEPS actually looked at this themselves several years ago, and they also have their own very informative report on healthcare costs using the same 2008 data I used for the charts below.

One thing I found interesting is that 16% of Americans produce no healthcare costs whatsoever, while the other 84% of us cost $1.15 trillion in 2008.

Next I looked at who is paying for what and for who.  I thought it was interesting that private insurance and private individuals paid for more than half (57%) of the total bill.  Also, while the pharmaceutical industry gets a lot of bad press, hospitals (29% or $330 billion) and physicians (24% or $271 billion) receive a larger share of the pie than pharma (22% or $247 billion).
As for Dr. Brennan's observations, it appears that the national picture is somewhat different - though still quite attractive from a "hotspot" perspective.  Here the top 1% of utilizers generate $211 billion or 18% of all healthcare costs, and the top 5% of utilizers generate $504 billion or 44% of the bill.  If Dr. Brennan's methods were to achieve similar success nationally then between $84 billion and $106 billion in savings might be achieved.  That's 7.3%-9.2% of the total bill, not too bad.

However, the doctor's program has costs.  It costs $225k to save 40%-50% of $10 million in his case.  So the cost is between 4.5% and 5.6% of the savings which would work out to be somewhere between $3.4 billion and $6.4 billion at the national level.  Taking the program cost into account reduces the savings to between $78 billion and $103 billion, or between 6.8% and 9.0% of the total bill.  I'd say that's money well spent.

But one of the points the doctor makes is to use the data to find the ripest areas for intervention.  So, who are these top 1%?  What do they look like?  Luckily the MEPS data has lots of information about its participants.  Here's a profile:

The chart compares the top 1% to all utilizers across a number of dimensions.  Perhaps not very surprisingly, the top 1% are much more likely to have an activity, cognitive, social or walking limitation.  So they are going to have trouble doing lots of things the rest of us might take for granted, like getting around, interacting with people, and understanding what's going on.  They are far more likely to assess themselves as having fair or poor health or mental health.  This does seem like a sober self assessment given the amount of healthcare they consume.  Finally, they are far more likely to be obese (BMI>30) which certainly can't help.

Demographically speaking, they are more likely to be in the low income category rather than in the middle or high income categories.  Perhaps related to their lower incomes, they are less educated - less likely to have made it through high school or college.  It isn't surprising at all to find that more than half of these top 1% are 55 or older, since prevalence of many diseases is higher in older populations.  Ethnicity outside white and black are under-represented among the top 1%, and they are just slightly more rural. I did not expect to see that they are more likely to be found in the West, given the healthy stereotype usually enjoyed by the West in the popular media.

Dr. Brenner's plan seems like it could be an important component of an overall effort to bring healthcare costs down.  I certainly hope it spreads outside of Camden!

Sunday, June 5, 2011

Housing Bubble

I can't recommend NPR's Planet Money podcast enough. I was a little disappointed, though, in their final entry in their series on gold and the meaning of money.  In it they have economists making claims that bubbles shouldn't happen according to economic theory, and yet there they are. To me this looks like gambling, which economists have explained by accounting for the value the gambler places on his enjoyment of the activity itself and his typically inaccurate perceptions of the risks involved, among other more complicated explanations.  For the housing bubble, before the bubble became apparent it would be reasonable for an investor to assume that the asset price was inflating at a higher than normal rate for some unknown temporary period.  At some point, though, the bubble became apparent and then gambling ensued. 

According to Google Trends, the term "housing bubble" in the US first exceeded its Jan-2004 through May-2011 average traffic in Q3 2004.  By Q3 2005, traffic on the term "housing bubble" was 2.85 times the average - the peak.  My interpretation of this is that everybody found out that there was a housing bubble, or at least that there was the possibility of a housing bubble, between Q3 2004 and Q3 2005.  But according to the Composite US S&P/Case-Shiller Home Price Index, housing prices didn't peak until Q2 2006.  If a gambling man purchased real estate when the "housing bubble" buzz first peaked up higher than average on Google in Q3 2004, and then sold at the peak in Q2 2006 he would have been ahead 19.8%.  Even if he waited until Q3 2006, the first time the Case-Shiller actually declined after Q3 2004, he would still be up 18.7%.  The only risk being taken here is that the quarter in which the real estate is purchased might actually be the peak, plus the transaction costs (which are not trivial in the case of real estate). 

But in the face of a possible 18%-20% upside, its not hard for me to see why gamblers were rolling the dice even though they knew there was a bubble.

Friday, February 18, 2011

Chronic Diseases Profiled

Much has been written about the many burdens of chronic disease.  The CDC states that chronic diseases are the leading causes of death and disability in the US.  The Partnership to Fight Chronic Disease and the Milken Institute estimate that chronic diseases like asthma, high blood pressure, heart disease, diabetes, cancer and stroke cost the US economy more than $1 trillion.

Such diseases are known to be correlated with obesity, smoking, and diet and so they are believed to be at least partially preventable.  So prevention has the potential to save the US healthcare system many billions of dollars.

A few weeks ago I posted up some interactive data on diabetics that I had pulled together for a project I was working on at NYU. The data came from MEPS, which the sources above also use in their analyses.  I found that I could make a slight tweak to my diabetes MEPS process to pull out the same information on several other chronic diseases.  And I've posted all these up to separate pages on my blog (upper left).  Here are the conditions available:

Wednesday, January 26, 2011

Interactive Diabetes Profile

I've been working with Prof. Dhar at Stern on applying data mining and machine intelligence technology to healthcare data.  I hope to post more on that soon.  While waiting on the logistics of obtaining the actual health plan data for the project, I used free government data from the Medical Expenditure Panel Survey (MEPS) to familiarize myself with the data mining software tools I'd be using later.  I've had experience working with the cumbersome MEPS data before and it is easily downloaded from their website so this was a natural choice.

I ended up with a pretty interesting dataset containing diabetics and demographically matched non-diabetics in 2008 for comparison.  And I found free access to Zoho Reports to render the data in an interactive form.  Not wanting all that work to go to waste I made the interactive reporting available here on my blog.  There is also a link to it in the new "Pages" section at the top of the righthand column. 

Please enjoy the data!

Wednesday, January 5, 2011

US Auto Market: USA vs Japan since WWII

Ford’s industrial and managerial innovations established the modern automobile industry in the US during the first part of the 20th century.  Coming out of World War II, Japan was just entering the automobile industry anew and was emerging from post-war reconstruction.  Japan's auto industry benefited from the highly protectionist policies the government used to support the domestic industry while it adopted and improved upon existing industry best practices for decades.  By the 1970s and 1980s the US consumer had responded, and by 1990 it had handed nearly one quarter of the US auto market to Japanese manufacturers. By 2009 it had about 40% of the US market.

One reason for this might be the rapid expansion of productivity in the Japanese auto industry.  Economists use the concept of total factor productivity to ascertain the portion of gains in output attributable to productivity as opposed to those related to the inputs of capital and labor.   Using data from a paper by Dale W. Jorgenson of Harvard and Koji Nomura of Keio University, I plotted "eras" of the Japanese and US auto industries' contribution to each country's average growth in total factor productivity since WWII.  On the same chart I plotted US automobile market share data from Wards Auto over the same period.    

Thursday, December 16, 2010

Demand Curve for Healthcare in Massachusets

Here's another post pulling from my assignments at Stern...

A subsidy can be thought of as a negative tax in that it is a payment from the government, either to the consumer or to the supplier of a good, that reduces the price the consumer actually pays below the price the supplier is actually selling.  The Patient Protection and Affordable Care Act of 2010 (PPACA) provides for subsidies to small businesses to encourage them to offer health insurance coverage to their employees, and direct subsidies to individuals who are neither in a government insurance program nor offered insurance by their employer.  You can read more about this on Wikipedia here.  Taken together, these subsidies are likely to have the classic economic effect on the price and quantity of healthcare delivered in the US (see Wikipedia  here for a more in depth discussion on the effects of subsidies).

When the government intervenes in a market to provide subsidies to the consumers of a good, the consumers experience a lower price and demand increases.  At the same time the suppliers receive a higher price and hence they supply more.  The economic benefit of the subsidy is shared between consumers and suppliers, but not necessarily equally.  The elasticities of supply and demand determine how it is shared. 

In Massachusetts they enacted a subsidy similar to PPACA back in 2006, offering an opportunity to measure the price elasticity of demand for healthcare.  The Medical Expenditure Panel Survey (MEPS) for 2005, 2006 and 2007 for Massachusetts, show the number of individuals with a healthcare expense (a proxy for quantity demanded) and their mean out-of-pocket healthcare expense (a proxy for the price).  In the chart below I plot these points to show the downward sloping demand curve.  I also show the price elasticity of demand between 2005 and 2006 to be -1.95 or relatively elastic, and the the price elasticity of demand between 2006 and 2007 to be -0.18 or relatively inelastic.  

But because 2006 was the year of implementation for the Massachusetts subsidy it might not be a good year to measure the effects.  So in a separate calculation I compared 2005 (the year before Massachusetts subsidized healthcare) with 2007 (the year after) and found the price elasticity of demand to be -0.56. Again, relatively inelastic. 

Inelastic demand means a steeper demand curve, which implies a larger benefit from the healthcare subsidy should flow to consumers relative to suppliers.  Drawing from this example, one might expect a similar effect for PPACA at the US level, and more of the benefit of the proposed healthcare subsidy should accrue to the consumers of healthcare versus the payors, providers, physicians, pharma and other suppliers of healthcare.

By design or by accident?

Sunday, November 21, 2010

Resource Curse

Another Stern homework assignment I thought was pretty interesting, this time I plotted per capital GDP against per capital natural resources by country for the year 2000.  Both data sets come from the World Bank, which has a pretty nifty Data Visualizer that is reminiscent of Hans Rosling's Gapminder.  You can see Hans' inspirational TED talk here.

My chart shows that just two countries, Canada and Norway, have been able to achieve more than $20k in per capita GDP when more than $30k in per capita natural resources are available.  Why might this be?  Wikipedia has a good discussion of the phenomena.


You can see the data behind the chart here.

Saturday, November 13, 2010

The Drug Industry vs. Cardiocascular Disease

I attended a symposium at NYU where venture capitalist Dr. John Freund of Skyline Ventures gave the keynote.  He used a chart from the CDC to illustrate the benefits of innovation in the pharmaceutical industry, even though it doesn't actually show any temporal relationship between specific pharmaceutical innovations and the dramatic decline in death rates related to cardiovascular disease and stroke (cerebrovascular disease).

I thought it would be more compelling to overlay the introduction of new classes of cardiovascular drugs on this data to see if the relationship would be obvious.  I used the the same sources of data (here and here) for the death rates, but I wasn't able to do the age adjustment from the CDC's chart.  I don't think this makes the visualization any less compelling.  I downloaded data from the FDA to overlay the innovations on the timeline of death rates.


You can see that the steep rise in cardiovascular disease death rates only began to flatten out once pharmaceutical innovation began.  Additional rounds of innovation eventually lead to steep declines in the death rates.

I really like the technique of overlaying qualitative information on top of trends in data.  I think it can be a very powerful way visualization non-quantitative information.

Monday, November 8, 2010

Chile's Chicago Boys

In one of my b-school courses we took a look at Chile's economic development.  Lots of stuff on Wikipedia you can look at (here, here, and here).  Fans of economist Milton Friedman like to use Chile as an example of how neo-liberal (liberal in the classical sense of the term) economic policies "work".

For a homework assignment I plotted both real per capita GDP and the Gini coefficient over time, and I overlayed relevant policy eras.  The Gini coefficent measures income inequality where the higher the number the more polarized the income distribution.  

Many economists think that income inequality rises during periods of intense economic development but then returns to more normal levels once the economy matures.  During the reign of the Chicago Boys, the Gini coefficient did rise but per capita GDP did not demonstrate any sustainable progress.  After a Latin American debt crisis in 1982 that hit Chile hard even though its finances were largely in order, Chilean dictator Pinochet replaced the Chicago Boys with a more moderate and pragmatic economic team.  Since then Chile has experienced sustainable growth in real per capita GDP, but stubbornly high (though flat) income inequality as measured by the Gini coefficient.  Chile had been tacking to the left politically until earlier this year and had been very gradually moderating its economic policies since Pinochet departed.  Recently, income inequality is thought to have moderated as a result of these policies, though the CIA still ranks Chile 14th highest income polarization in the world.

The interesting thing about the course is that it points out how unique the US is in that it was able to develop economically while being politically free.  This is quite rare.  Most countries need to get lucky with a benevolent dictator who imposes economic order for a period (even if there is a massacre here or there) and then peacefully steps aside at just the right time.  Once the country achieves a certain degree of stability under authoritarian rule, then in order for growth to continue legitimacy is needed - the kind that only comes from a freely elected government - and its richer population can afford to give itself the privilege of increasing political and social freedoms.

Saturday, October 23, 2010

Clothiers Play to Whites' Waist Delusions

This interesting graphic from Esquire has been circulating on the internet for a while.  It shows how some clothiers build a lot of extra inches into the actual waists of their 36" sizes.

I was curious as to why this might be, though I know I feel better about myself when I put on a pair of pants with a reported waist size much smaller than the reality of my waist. let's you type in a website and see the demographics of the visitors of that website.  It gives an index value for each demographic attribute that shows how different the attribute is from the average internet user.  A value of more than 100 means that the website's visitors are more likely to have that attribute than the average internet user, and a value of less than 100 means the opposite.  Here's what it looks like for

I ran the Quantcast charts for each of the websites of the clothiers in Esquire's graphic, and I noticed that the less likely the visitors to a clothier's website are to be Caucasian, the fewer inches need to removed from the advertised waist size.  I put this in a scatter plot and let Excel draw the regression line.  While there aren't a lot of data points here, there's a pretty clear pattern nonetheless: whiter customers like to be lied to about the actual size of the pants they put on.

Meanwhile, I'm heading into Old Navy!