Beware of version updates!

May 10, 2016 by Jennifer Ailshire Leave a comment

Data management of large, complex surveys like the HRS is a fairly difficult and time-consuming task. The HRS staff endeavors to produce a public-release data set as soon as possible (and they produce them much faster than many other publicly funded data collections!), so they release an early version of the data: Core Early Release (V1.0)*. The updated RAND files tend to follow soon after.

HRS does an early release to get the data to the public as fast as possible, but HRS staff continues to process the data until they have a final data release. Sometimes the early release is ultimately designated as the final release – you’ll know this is the case if you see “Final V1.0” followed by the date of release. But sometimes there are issues in the data that need to be resolved. These tend to be the result of programming errors, but there are sometimes problems with the data that the HRS staff catch during data inspections after the early release (e.g., a case is designated as a non-sample member after closer inspection). So, in some years the final release is a V1.0, but in other years it may be a V2.0 or V3.0. And in rare cases the final release is a V4.0 or V5.0.

What is the significance of all of this for the user? Well, if you’ve downloaded an early release version of the data that then underwent updates affecting variables you work with, you are effectively using the wrong data set. This has happened to me – twice. Most recently, I was conducting some analysis and puzzling over why I was dropping people under age 56 from my models (using 2010 data that should have had the full age range). After a frustrating 10 minutes or so of inspecting the data, I determined that these people were missing values on my dependent variable. ‘Did HRS decide not to ask this question to the refresher cohort?’ I wondered. So I checked the online codebook… and found frequencies that did not match those in my data. Because it had happened to me once before that the codebook frequencies didn’t match those in my version of the data (and I learn from my mistakes, thankfully), I immediately checked the version number on the website. I was surprised to see V.5.0 (they don’t typically go that high) but I had no idea what version I was working with. What I did know was the date associated with the data I was working with, which was prior to the date shown for the V5.0 data release. Aha! I had the wrong version of the data – I had never downloaded the final release data set. So I downloaded the final V.5.0 version and my analysis issue was resolved.

As users we agree to a compromise with HRS. They get the data into our hands as fast as possible, but as an early release that’s subject to change. It’s our responsibility to check for updates to the data download the final release. I’m sure for most users this will be a non-issue because they likely don’t access the data until the final version is available. But for those of us who look forward to the newest wave of data with great anticipation, it would be wise to take a moment to check to make sure we’re ultimately working with the final version.

*Other data files are released thereafter, including the exit and post-exit files, tracker file, and biomarker file.

Using the HRS

Insights and advice for new and seasoned users of the Health and Retirement Study

Beware of version updates!

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply