Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Saturday 23 March 2019

What makes a good data dictionary?


A data dictionary is a guide, external to the dataset in question, that explains what each variable is in a human-readable format. In R, the programming equivalent of a data dictionary is the result of the str() function, which will show the first few values of each variable, the format (e.g. numerical, string, factor), and other important information (e.g. the first few levels of the factor).

A data dictionary may include software-specific features, but it should still be value to anyone using the dataset, regardless of the software they are using.

Other than that, what makes a good data dictionary?




General Features:


- Variable names. This is the variable names as they show up in the dataset itself. They are typically in the order of the columns of the dataset.

- Label / Description. This is where most of the human-readable information about the variable should be. It doesn’t have to be a complete sentence. It should include at LEAST as much information as a human would need to describe a distribution of the dataset without looking anything up.

- Values describes what individual values of the variables could look like.

Additional features can include…

- The SQL table where the variable is found.

- The person who created or recorded the variable, in the case of projects with multiple authors.

- Additional specifics like the type of date and time format used, or the maximum number of characters allowed (in the case of string or text data).

- Additional calculation information for derived variables.


Source: http://biostatisticsryangoslingreturns.tumblr.com


How much label information is enough?


Imagine you had a dataset from one source, and it was fully labelled and had a well-made data dictionary. Then, later, you find another dataset which looks like it could have additional observations of the same data.

But… the new data has completely different variable names, and variables are in a different order.

What information would you need to be able to add the new data to your original data set? That’s what the description should tell you.

Derived variables need additional information.


One special case is derived variables, which are ones that are calculated from existing variables. These should include enough information that someone could calculate these variables again if they were lost (or needed to be calculated for new data).

This includes things like sampling weights, for which calculations are often neglected. Sampling weights can be computed in many different ways, so it’s important to record how something was generated in case you or someone else later needs to recreate it.

When do you need to include something more in the ‘values’ section?


1) If the number is a special format like a date, time, or phone number.

2) If the number is used as an ID for an observation. This ID might appear in other datasets with additional information about that observation.

3) If there are certain values that are markers for missing or unusual data, like 998 or -1.

4) If it’s a categorical variable, all the categories should be listed UNLESS there are many categories (e.g. species of insect if there are 20 species), or the categories come from a standard, well established list (e.g. countries of the world).

It’s fine to include the entire Likert scale if you’re using one. If you want to save space, you can describe the Likert scale you’re using once and just say “see variable XXXXX” for other variables using the same scale.


Example 1: From a study on recruitment of people in a sexual network to track HIV.

For variables that require extra explanation, such as how they were derived from original survey data, an additional feature column is used called “notes”.



Variable
Label or Description (* for inferred)
Values
QUDATE
Visit Date
Datetime
QUGRP
Group
All values are 8
COUPID
Coupon ID
Unique Identifier,
30000-34500
DESEX
Respondent's Sex
1 = Male, 2 = Female,
(998, 4e4) = Missing
RMPANR
Steady Male Partners Last 6mos
Count
RFPANR
Steady Female Partners Last 6mos
Count
CMPANR
Casual Male Partners Last 6mos
Count
CFPANR
Casual Female Partners Last 6mos
Count
BMPANR
Number of Males bought sex from Last 6mos
Count
BFPANR
Number of Females bought sex from Last 6mos
Count
SMPANR
Number of Males sold sex to Last 6mos
Count
SFPANR
Number of Females sold sex to Last 6mos
Count
HIV
HIV Status
Neg, Pos, Blank
obsNET
Observation Number
Unique Identifier, 1-1050
HeteroRel_M
* # of Non-Commercial Hetero Partners, Men
Count, Derived
HeteroRel_F
* # of Non-Commercial Hetero Partners, Women
Count, Derived
CommRel_M
* # of Commercial Hetero Partners, Men
Count, Derived
CommRel_F
* # of Commercial Hetero , Women
Count, Derived
MSMRel
* # of Male Partners of Men
Count, Derived
hetero
* Indicator, any hetero sexual partners last 6 months
0 = No, 1 = Yes
comm
* Indicator, any commercial sexual partners
0 = No, 1 = Yes
homo
* Indicator, any homosexual partners last 6 months
0 = No, 1 = Yes




Example 2: From data tracking player engagement and achievement in a mobile game.


This data dictionary is designed for use in an SQL relational database. In the figure, additional features include the data table that these variables are found in, the SQL format (note that ‘string’ is recorded as ‘varchar’).




Variable Name

Description
data_type_id
Metadata, what table this data is for
timestamp
Time and day event was reported
timestamp_date_id
Day event was reported
achievement_id
In-game achievement
current_pct
Cumulative progress to achievement
delta_since_last_submit
Change in progress to achievement
sponsor_id
Brand channel attached to user
sdk_version
Version of SDK used
game_version
Version of developer's game used
os_version
Version of mobile operation system
hw_model
Hardware model of mobile device
user_id
Registration ID number of the user
ts_of_first_play
Timestamp of this user's first play
total_num_of_game_played
# of sessions user has had of this game
total_duration_of_game_played
play time in seconds total
ts_of_last_successful_submit
Timestamp of the report previous to this one from this user-game-type






















 Example 3: Health Survey in South Africa


Notice how in the figure, there is an extra feature column for SAS format, so people working with the data in SAS can better track what’s going on. These variables are selected from a data dictionary which had over 900 lines of information. It includes pre-designed questionnaires like the food insecurity questionnaire, the CES-D depression scale, a self esteem scale, and an alcoholism scale. For brevity in both the figure and the text, the possible categorical responses have been greatly shortened.


Variable
Label
Coding
RESPID
Respondent ID
Checkbox (see Label)
COLLID
Collector ID
Checkbox (see Label)
START_DATE
Start Date of Interview (SAS format)
Days since Jan 1, 1960
CELLDAY
How much time do you spend using a cell phone?
0-1 hours = 1, 2-4 hours = 2,… , 10+ hours = 5
DRWATER
What is your main source of drinking water?
Home tap = 1, Tank = 2, Communal tap = 3,…
DRWATER_
Other (please specify)
Written.



_NOFOOD
Ever no food at all in your household?
Never = 2, Rarely = 3, Often = 4
_HUNGRY
Go to sleep at night hungry because not enough food?
Never = 2, Rarely = 3, Often = 4
_DYNOEAT
Go a whole day without eating because not enough food?
Never = 2, Rarely = 3, Often = 4
FoodIS
Food Insecurity Max of values like _NOFOOD, _HUNGRY…
Low = 2, Moderate = 3, High = 4



ADLTO20C
Any adults aged 20 years or older living with you currently?
Yes = 1, No = 3, Unsure = 7, No answer = 99.
MOMALIVE
Is your mom alive?
Yes = 1, No = 3, Unsure = 7, No answer = 99.



ALC6M
Have you had alcohol to drink in the past 6 months?
Yes = 1, No = 3, Unsure = 7, No answer = 99.
ALCFREQ
How often do you drink alcohol?
<Once/Month = 2, <Once/Week = 3, ….
DRUNK6M
How many times in the past 6 months have you been drunk?
<Once/Month = 2, <Once/Week = 3, ….
CUTDOWN
Have you ever felt you should cut down on your drinking?
Yes = 1, No = 3, Unsure = 7, No answer = 99.



_GDQUAL
I feel that I have a number of good qualities.
Strongly Agree = 1,..., Strongly Disagree = 5
_FAILUR2
All in all, I am inclined to think I'm a failure.
Strongly Agree = 1,..., Strongly Disagree = 5
SE_score
Self-Esteem score Weighted sum of _GDQUAL _FAILURE2,
Higher = Higher Self-Esteem




No comments:

Post a Comment