Skip to content

Week 3

Initial Data Analysis

Before conducting formal statistical analyses, a consultant must understand the structure and quality of the data. This week focuses on initial data analysis, including identifying missingness, outliers, implausible values, and inconsistencies across measures. Careful early exploration prevents downstream errors and strengthens the consulting process.

By the end of this week you will be able to:

  1. Identify common data quality issues that arise in real studies.
  2. Articulate how initial data exploration informs a consulting analysis plan.

Data Dictionary.
Data File.

Before Class

Data Quality Control Exercise

Before beginning formal analysis, it is essential to check the quality and structure of your data. While study stuff may perform data cleaning, this responsibility can vary across collaborations. As consultants, we must be prepared to evaluate data critically before proceeding.

This week's dataset comes from an intervention examining whether moving to an "active" community (treatment) is associated with differences in calorie consumption, physical activity, and BMI.

Physical activity is measured in three ways:
* Accelerometer data, which records minutes of moderate-to-vigorous physical activity (MVPA) using a wearable device.
* The ARIZONA survey, which estimates hours spent in activity categories based on MET classifications.
* Self-reported activity variables (beginning with "time_"), which capture reported time spent in specific activity types.

You will explore these measures, along with dietery intake and BMI, to evaluate data quality and identify potential concerns before modeling.

💻 You will be assigned to one of the following IDA steps. Be prepared to share your findings with the class during our next session.

Task 1. Understand Data Structure

Look at:

  • Number of rows/columns
  • Variable types (numeric, categorical)
  • Range of each varaible
  • Variable units

Produce a summary table of variable types and ranges.

Task 2. Quantify Missingness

Look at:

  • Percent missing for key variables
  • Patterns of missingness (by treatment group)

Produce a table of missingness for each variable, and by treatment group.

Task 3. Identify Outliers

Produce boxplots for BMI, calories, and one PA measure. Look at:

  • Extreme values
  • Impossible values
  • Implausible values

Explain how you identified outliers and whether you think these are real values or data entry errors.

Task 4. Explore Distributions

Produce histograms for BMI, calories, and one PA measure. Look at:

  • Shape
  • Skewness

For each of these variables, explain whether you think the variable follows a common distribution, and which.

Task 5. Evaluate Internal Consistency

Select two PA measures and look at:

  • A scatter plot for the relationship between the two variables
  • The value of the correlation betwen the two variables

State how strongly these variables, which measure the same concept, are related.

Task 6. Explore Basic Relationships

Examine the relationship between treatment group and one outcome, producing:

  • A graphic of the relationship
  • Means/medians of the outcome by the levels of treatment group

State whether this relationship was expected, and which statistical test you could perform to evaluate it further.

Task 7. Calculated Variable Checks

Compute the percentage of calories from each macronutrient (carbs, protein, fat). As a note, each gram of carbs or protein contributes 4 calories, and each gram of fat contributes 9 calories.

  • Check whether the sum of the calories from each macronutrient equals the total calories

Task 8. Implausible Non-Outliers

Identify individauls with calorie intake <500 kcal/day, or other unusual values, providing:

  • The number of individuals that fall into this category
  • A relationship between calorie intake and another related variable, such as BMI or PA

Comment on the plausibility of these values in light of real-world context.

In Class

In this week's class, we will review the analyses you and your group performed.

Reflection

There can be a lot to consider when it comes to initial data analysis. What are some strategies you can employ to not become overwhelmed with the process?

If you could ask the investigator one question based on your findings, what would it be?

Supplemental Readings

📖 Ten Simple Rules for Initial Data Analysis (20 minutes)

📖 Exploratory Data Analysis (30 minutes)