Statistics et al.: Reading Assignment

Tuesday, 8 March 2022

Reading Assignment – Collecting Carefully.

This is a reading assignment for “Episode 28: Collect Carefully” of “The Data Science Ethics Podcast”, available at: https://datascienceethics.com/podcast/collect-carefully/ . My eventual hope is to incorporate it into a course on data ethics and (AI) safety, but that's still a long way from being anything solid. From recent interviews with post secondary institutions, I heard that a lot of schools are looking to incorporate ethics into their stats and data science courses, so I hope this and some of my future posts can contribute to those efforts.

For this reading assignment, read the transcript to “Episode 28: Collect Carefully” of “The Data Science Ethics Podcast”, available at: https://datascienceethics.com/podcast/collect-carefully/ and answer the following questions. Answers can be verbatim from the transcript unless they say ‘in your own words’ or are follow-up questions.

There are 8 comprehension-level questions and 3 deeper follow-up questions that require deeper investigation or thought.

Note: Make sure to click on the “View full episode transcript” tab inside the page.

Note: The transcription on the webpage is good, but imperfect. (e.g., “forcing that feel to be completed” should read “forcing that FIELD to be completed”.) If something doesn’t make sense, the audio download of the podcast episode is available on the webpage too. It’s 15-20 minutes at normal speed.

Q1. In your own words, what does “pre-hoc” mean?

Q2. What are two reasons that you don’t include data/variables just for the sake having them? (i.e., including them without have a good reason)

Q3. What is an example of a variable that could be used as a proxy for race?

Q3F (follow-up). What are two other variables not mentioned in the transcript that could be used as a proxy for a protected class? (e.g., sex, gender, ethnicity, religion, sexual orientation, age above 45)

Q4. What is a common problem with AI and machine learning algorithms that were trained on visual data of humans?

Q4F (follow-up). Describe in your own words the “Google Gorillas” problem being alluded to in this transcript? (There another episode transcript about the problem at http://datascienceethics.com/podcast/google-gorilla-problem-photo-tagging-algorithm-bias/ , but there are many other sources about this)

Q5. What an example of an “adversary” problem that can come from retail worker incentives to collect data, such as emails?

Q6. What does Marie imply by using the term “clean data”?

Q6F (follow-up). Compare this definition of “clean data” to the tidyverse definition of “tidy data”. (How are these two definitions different, despite sounding so similar)

Q7. What are two ways that the design of your survey can affect the answers that you collect?

Q8. What are some ways that sensor data can be become biased?

(Message me at mj2davis@uwaterloo.ca for an answer key)

Statistics et al.

Featured post

Textbook: Writing for Statistics and Data Science

Tuesday, 8 March 2022

Reading Assignment – Collecting Carefully.

No comments:

Post a Comment