## Featured post

### Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

## Saturday 23 March 2019

### What makes a good data dictionary?

A data dictionary is a guide, external to the dataset in question, that explains what each variable is in a human-readable format. In R, the programming equivalent of a data dictionary is the result of the str() function, which will show the first few values of each variable, the format (e.g. numerical, string, factor), and other important information (e.g. the first few levels of the factor).

A data dictionary may include software-specific features, but it should still be value to anyone using the dataset, regardless of the software they are using.

Other than that, what makes a good data dictionary?

#### General Features:

- Variable names. This is the variable names as they show up in the dataset itself. They are typically in the order of the columns of the dataset.

- Label / Description. This is where most of the human-readable information about the variable should be. It doesn’t have to be a complete sentence. It should include at LEAST as much information as a human would need to describe a distribution of the dataset without looking anything up.

- Values describes what individual values of the variables could look like.

- The SQL table where the variable is found.

- The person who created or recorded the variable, in the case of projects with multiple authors.

- Additional specifics like the type of date and time format used, or the maximum number of characters allowed (in the case of string or text data).

- Additional calculation information for derived variables.

 Source: http://biostatisticsryangoslingreturns.tumblr.com

#### How much label information is enough?

Imagine you had a dataset from one source, and it was fully labelled and had a well-made data dictionary. Then, later, you find another dataset which looks like it could have additional observations of the same data.

But… the new data has completely different variable names, and variables are in a different order.

What information would you need to be able to add the new data to your original data set? That’s what the description should tell you.

#### Derived variables need additional information.

One special case is derived variables, which are ones that are calculated from existing variables. These should include enough information that someone could calculate these variables again if they were lost (or needed to be calculated for new data).

This includes things like sampling weights, for which calculations are often neglected. Sampling weights can be computed in many different ways, so it’s important to record how something was generated in case you or someone else later needs to recreate it.

#### When do you need to include something more in the ‘values’ section?

1) If the number is a special format like a date, time, or phone number.

2) If the number is used as an ID for an observation. This ID might appear in other datasets with additional information about that observation.

3) If there are certain values that are markers for missing or unusual data, like 998 or -1.

4) If it’s a categorical variable, all the categories should be listed UNLESS there are many categories (e.g. species of insect if there are 20 species), or the categories come from a standard, well established list (e.g. countries of the world).

It’s fine to include the entire Likert scale if you’re using one. If you want to save space, you can describe the Likert scale you’re using once and just say “see variable XXXXX” for other variables using the same scale.

#### Example 1: From a study on recruitment of people in a sexual network to track HIV.

For variables that require extra explanation, such as how they were derived from original survey data, an additional feature column is used called “notes”.

 Variable Label or Description (* for inferred) Values QUDATE Visit Date Datetime QUGRP Group All values are 8 COUPID Coupon ID Unique Identifier, 30000-34500 DESEX Respondent's Sex 1 = Male, 2 = Female, (998, 4e4) = Missing RMPANR Steady Male Partners Last 6mos Count RFPANR Steady Female Partners Last 6mos Count CMPANR Casual Male Partners Last 6mos Count CFPANR Casual Female Partners Last 6mos Count BMPANR Number of Males bought sex from Last 6mos Count BFPANR Number of Females bought sex from Last 6mos Count SMPANR Number of Males sold sex to Last 6mos Count SFPANR Number of Females sold sex to Last 6mos Count HIV HIV Status Neg, Pos, Blank obsNET Observation Number Unique Identifier, 1-1050 HeteroRel_M * # of Non-Commercial Hetero Partners, Men Count, Derived HeteroRel_F * # of Non-Commercial Hetero Partners, Women Count, Derived CommRel_M * # of Commercial Hetero Partners, Men Count, Derived CommRel_F * # of Commercial Hetero , Women Count, Derived MSMRel * # of Male Partners of Men Count, Derived hetero * Indicator, any hetero sexual partners last 6 months 0 = No, 1 = Yes comm * Indicator, any commercial sexual partners 0 = No, 1 = Yes homo * Indicator, any homosexual partners last 6 months 0 = No, 1 = Yes

#### Example 2: From data tracking player engagement and achievement in a mobile game.

This data dictionary is designed for use in an SQL relational database. In the figure, additional features include the data table that these variables are found in, the SQL format (note that ‘string’ is recorded as ‘varchar’).

 Variable Name Description data_type_id Metadata, what table this data is for timestamp Time and day event was reported timestamp_date_id Day event was reported achievement_id In-game achievement current_pct Cumulative progress to achievement delta_since_last_submit Change in progress to achievement sponsor_id Brand channel attached to user sdk_version Version of SDK used game_version Version of developer's game used os_version Version of mobile operation system hw_model Hardware model of mobile device user_id Registration ID number of the user ts_of_first_play Timestamp of this user's first play total_num_of_game_played # of sessions user has had of this game total_duration_of_game_played play time in seconds total ts_of_last_successful_submit Timestamp of the report previous to this one from this user-game-type

#### Example 3: Health Survey in South Africa

Notice how in the figure, there is an extra feature column for SAS format, so people working with the data in SAS can better track what’s going on. These variables are selected from a data dictionary which had over 900 lines of information. It includes pre-designed questionnaires like the food insecurity questionnaire, the CES-D depression scale, a self esteem scale, and an alcoholism scale. For brevity in both the figure and the text, the possible categorical responses have been greatly shortened.

 Variable Label Coding RESPID Respondent ID Checkbox (see Label) COLLID Collector ID Checkbox (see Label) START_DATE Start Date of Interview (SAS format) Days since Jan 1, 1960 CELLDAY How much time do you spend using a cell phone? 0-1 hours = 1, 2-4 hours = 2,… , 10+ hours = 5 DRWATER What is your main source of drinking water? Home tap = 1, Tank = 2, Communal tap = 3,… DRWATER_ Other (please specify) Written. _NOFOOD Ever no food at all in your household? Never = 2, Rarely = 3, Often = 4 _HUNGRY Go to sleep at night hungry because not enough food? Never = 2, Rarely = 3, Often = 4 _DYNOEAT Go a whole day without eating because not enough food? Never = 2, Rarely = 3, Often = 4 FoodIS Food Insecurity Max of values like _NOFOOD, _HUNGRY… Low = 2, Moderate = 3, High = 4 ADLTO20C Any adults aged 20 years or older living with you currently? Yes = 1, No = 3, Unsure = 7, No answer = 99. MOMALIVE Is your mom alive? Yes = 1, No = 3, Unsure = 7, No answer = 99. ALC6M Have you had alcohol to drink in the past 6 months? Yes = 1, No = 3, Unsure = 7, No answer = 99. ALCFREQ How often do you drink alcohol?