Personal Analytics 101: How to Deal with Holes In Your Self-Tracking Data
It happens to all of us: one day you forget to charge your Zeo or Bodymedia, go out of town on a business trip, come down with nasty flu, or simply forget to fill out a section in your self-questionnaire. Your self-tracking routine somehow gets interrupted, and some questions are left unanswered. As a result, your self-tracking data now has some gaps and holes. In this post, I will discuss why you should not be really concerned about it, and how to correctly “patch up” your data, if missing values still bother you.
Basically, when it comes to missing values in your self-tracking data, you have three following options: to fill in the gaps by giving bogus answers, leave data missing as it is, or use simple statistical procedure to “restore” the answers. Let’s take a look at each of the options.
Option 1: Make Up Data
Actually, this one should never be an option. Cheating is never good, but why would you lie to yourself? Making up answers is especially bad if you plan on looking for meaningful patterns in your data. Most times, self-trackers deal with small number of observations, and adding as much as one random answer may introduce considerable amount of “noise” in your data and affect the insights.
Option 2: Leave It as It Is
Believe it or not, but in most cases, leaving those “holes” in your dataset is ok! Excel and other software usually exclude missing values from the analysis, as long as you code them correctly. In Excel, for example, missing values can be logged as blank spaces or any non-numeric values. One thing, however, is important: you should never use zeros (0), because they will be included into calculations. Here is just one quick example of how using zeroes instead of missing values can considerably change the results. Imagine that you have been tracking your sleep with Zeo for seven consecutive nights, Monday through Sunday. On Thursday morning, you forgot to charge Zeo headband, and as a result, you ended up with missing ZQ score for that night: 100,80,90,?,120, 60, 70. Later you turn to Excel to compute your average ZQ score for that week. If you log ZQ’s value for that night as blank space, X (or period, or any other non-numeric value), you will end up with the average weekly ZQ score of 86.7. If you, however, enter 0 (zero) for Thursday, your average score will drop to 74.3. That’s because Excel interpreted 0 as a real score (i.e., you did not sleep at all), and included it into formula. So remember, never use 0 to record the missing value, leave it blank instead.
Option 3: Patch It Up
Of course, you may have some reasons for not wanting those missing values in your data. For instance, you would like to visualize your sleep data for that week, and the missing Thursday night’s sleep score will leave unsightly gap in your graph. In that case, you can use two statistical methods to patch up your data.
Mean/Median/Mode Imputation is a relatively simple procedure. You fill in the missing cell with the most expected value. Depending on type of variable you are working with, the most expected value would be either arithmetic average (mean), median, or mode. If you are dealing with the numeric variable, you can use mean or median of that variable to fill in the missing value. I personally prefer median, because most of the time we are dealing with non-parametric data. In an example with Zeo, I would enter all numbers into Excel, and then use formula MEDIAN() to compute the median of 100,80,90,120, 60, 70. The resulting value (80) can be used as a ZQ score for Thursday. Voila! You have no more gaps! In case of categorical variables, you would need to use mode (MODE in Excel): the most frequent value.
Ascription method is more suitable when you want not only to fill-in the missing value, but also preserve the potential relationship of that variable with other variables. This method is a bit more complicated, and it works best when you have accumulated a lot of data. Let’s say you track variables A, B, C, X, Y and Z on a daily basis. If for a particular day (e.g., “sick day”), variable X happened to have a missing value, you can go through your records, find a day that has values of A, B, C, Y and Z as close as possible to the values on a “sick day”, and use that value to fill in the cell. Let’s look at the theoretical example . Imagine that in addition to sleep, you also track number of steps, calories burned, mood, and how tired you feel before going to bed:
Please note that this is not real data I simulated these numbers for illustrative purposes. As you can see, something happened on January 8th, and I did not record my Zeo data. But I can use ascription method to fill-in the missing value. By comparing the records, I find another date that has steps, calories, mood and energy levels data as close to those from January 8th as possible. That date is January 2nd, so I can take the ZQ value from that day (110) and use it to fill-in the cell for January 8th. If you look at the numbers closer, you may see that there is a complex pattern of relationship between the ZQ score, number of calories burned, tiredness, and mood. The mean/median substitution would most likely ignore this relationship. The ascription method can be applied to categorical data, too. In fact, if you have lots of data (e.g, 5 or 6 months worth), you may “ascribe” the entire day (i.e., all variables A through Z), provided that your routines were not different on that day.
You should also remember that imputation and ascription methods are based on the assumption that the data is missing at random. In other words, WHAT you forgot to log is not related somehow to WHY you did not log it. If, for example, you often forget to log your mood on Saturdays, simply using average/median monthly mood rating to fill-in the missing values would not be appropriate. You should use whatever Saturday mood data available to compute the median, or apply ascription method but use records only from other Saturdays. That’s because there is a certain connection between the day being a Saturday, and you forgetting to log your mood.
Finally,both imputation and ascription methods should be used at the end of data collection (e.g., end of the month), after you collected enough data. The more complete data you have, the more reliable . Feel free to drop me a line if you have any questions!