Missing, blank, or not-applicable data is important consideration for any data management system. This article describes in detail how ActivityInfo addresses "missing-ness" and how it compares to other systems and programming languages.
Introduction
ActivityInfo, like many data management systems, works exclusively with "typed" values. All fields and formula expressions in ActivityInfo have a type that determines the range of allowable values. A quantity field can never have the value "Bob", and a date field will never have the value 42. The same is true for a vector in R, or a column in a relational database column, or a SPSS variable.
However, there may be many reasons that a value might be absent:
- A field might not be relevant for every record or row. For example, it might be the case that only refugees have a UNHCR registration number.
- A field might not be required. Not all respondents to a questionnaire may have a mobile phone number.
- A field might be added after the start of data collection, and we have no information on previously-collected subjects.
- A survey respondent may refuse to answer specific questions, or may not know.
- There might be an error calculating a value, for example, if you divide a number by zero, or provide an invalid regular expression to
REGEXMATCH
For this reasons, each ActivityInfo formula type includes a special "blank" value. A numeric expression, for example, can be equal to any real number, or it can be blank. A date expression, can be equal to any date between 1000 CE and 9999 CE, or it can be blank.
Calculations involving blank values
In ActivityInfo, formulas, you can test for blank values using the ISBLANK
. But blank values can also find their way into calculations. For example, you might have a form with two quantity fields for the count of COUNT_WOMEN
and COUNT_MEN
, as well as a calculated field TOTAL
field with the formula COUNT_WOMEN + COUNT_MEN
.
What happens if some records contain blank values for WOMEN
, MEN
or both? This is a design choice: in Excel, blank cells are treated as zeros in the context of arithmetic. In SQL, however, NULL
plus any other number is NULL
, and in R, NA
plus any number is still in NA
.
ActivityInfo follows Excel's convention in most case, which we judged to be more forgiving for non-technical users. So for the example above, the results might look like this:
Health Center | WOMEN | MEN | WOMEN + MEN | ISBLANK(WOMEN + MEN) |
---|---|---|---|---|
A | 300 | 150 | 450 | FALSE |
B | 160 | 160 | FALSE | |
C | 75 | 75 | FALSE | |
D | TRUE | |||
E | 220 | 90 | 310 | FALSE |
As you can see in the table above, the TOTAL
calculated field is only blank when both WOMEN
and MEN
are blank.
Empty text strings
In some programming languages, there is a difference between an empty text string (""
) and NULL
or missing value. For example in SQL, "" IS NULL
is false, and in R, is.na("")
is also false.
This distinction can often trip up users without a programming background, and so ActivityInfo avoids this distinction: an empty text string ""
is identical to a blank text value:
ISBLANK("") == TRUE
During data collection, any input to a text field will be stripped of leading and trailing spaces. That means if you type only spaces into a text field, the field's value will be blank:
Leading and trailing space is not stripped from values in formulas, which means that text with only whitespace is not considered blank:
ISBLANK(" ") == FALSE