🪓Data Journalism Unit 2 Review

2.3 Data collection and analysis workflow

🪓Data Journalism
Unit 2 Review

2.3 Data collection and analysis workflow

Written by the Fiveable Content Team • Last updated September 2025

🪓Data Journalism

Unit & Topic Study Guides

2.1 Identifying data-driven story ideas

2.2 Planning and project management

2.3 Data collection and analysis workflow

2.4 Publishing and presenting data stories

Data collection and analysis are crucial steps in data journalism. They involve identifying relevant sources, assessing quality, and cleaning raw information. This process transforms messy data into structured insights, setting the stage for meaningful storytelling.

Journalists must navigate legal and ethical considerations while employing various analytical techniques. From descriptive statistics to machine learning, these methods uncover patterns and relationships in data. Effective interpretation and communication of findings bring the story to life for readers.

Data Sources for Storytelling

Identifying Relevant Data

Data sources can include government databases, academic research, surveys, web scraping, APIs, and more
The choice of data source depends on the specific story angle and information required (government database for crime rates, surveys for public opinion)
Understanding the level of detail and granularity of the data is important for determining its usefulness (individual-level data vs. aggregated statistics)
Combining and cross-referencing multiple data sources can provide a more comprehensive understanding of an issue (census data and health records to explore socioeconomic disparities in health outcomes)

Assessing Data Quality and Credibility

Factors to consider include the reputation of the source, methodology of data collection, sample size and representativeness, currency of the data, and any potential biases
Assessing the credibility of data sources involves researching the organization or individuals behind the data and their data collection methods (peer-reviewed academic studies vs. self-reported surveys)
Sample size and representativeness impact the generalizability of the data (a large, nationally representative survey vs. a small convenience sample)
The currency of the data is important for ensuring the insights are up-to-date and relevant (using the most recent census data vs. outdated figures)

Legal and Ethical Considerations

Some data sources may have usage restrictions or require permissions/licenses (proprietary datasets, sensitive personal information)
It's important to understand any legal or ethical considerations around data access and use, such as privacy regulations, intellectual property rights, and informed consent
Ethical considerations may include protecting the privacy and confidentiality of individuals in the data, especially for sensitive topics (anonymizing data, secure storage)
Journalists have a responsibility to use data ethically and avoid misrepresentation or misleading conclusions (presenting data in context, acknowledging limitations)

Data Cleaning and Validation

Data Cleaning Techniques

Data cleaning involves identifying and correcting errors, inconsistencies, and missing values in the raw data
Standardizing formats ensures consistency across the dataset (converting all dates to the same format, harmonizing categorical variables)
Handling outliers requires careful consideration of whether they are genuine anomalies or data entry errors (investigating extreme values, deciding to include or exclude them)
Dealing with duplicates involves identifying and removing or consolidating repeated entries (multiple records for the same individual or event)
Documenting the cleaning steps is important for transparency and reproducibility (noting assumptions made, transformations applied)

Data Processing and Transformation

Data processing transforms the cleaned data into a structured format suitable for analysis
Normalizing values puts variables on a common scale for comparability (converting raw counts to percentages or rates)
Aggregating data involves summarizing individual-level data at a higher level (calculating total sales by region or average test scores by school)
Calculating derived variables creates new metrics based on existing data (computing body mass index from height and weight, creating categorical age groups from continuous age data)
Merging datasets combines data from different sources based on common identifiers (joining customer information with transaction records using a unique ID)

Data Validation Techniques

Data validation checks the accuracy, completeness, and consistency of the processed data
Cross-referencing with other sources helps verify the data's validity (comparing reported crime statistics with police records)
Checking for logical inconsistencies identifies implausible or contradictory values (a person's birth date being later than their registration date)
Verifying calculations ensures derived variables and summary statistics are computed correctly (manually checking a sample of calculated values)

Exploratory Data Analysis (EDA)

EDA is used to understand the structure, patterns, and relationships in the data before formal analysis
Calculating summary statistics provides a quantitative overview of the data (mean, median, standard deviation, range)
Visualizing distributions and trends helps identify patterns and outliers (histograms, box plots, line graphs)
Identifying unusual observations or potential issues guides further investigation and cleaning (detecting clusters, gaps, or anomalies in the data)

Data Analysis Techniques

Choosing Appropriate Analytical Methods

The choice of analytical method depends on the type of data, the research question, and the desired output
Numerical data can be analyzed using statistical methods like regression analysis or hypothesis testing (housing prices, test scores)
Categorical data requires methods suitable for discrete variables, such as chi-square tests or logistic regression (survey responses, patient outcomes)
The research question guides the selection of methods (descriptive analysis for understanding patterns, predictive modeling for forecasting future trends)

Descriptive and Inferential Statistics

Descriptive statistics summarize and describe key features of the data
Measures of central tendency provide a typical or central value (mean income, median age)
Measures of dispersion quantify the spread or variability in the data (range of temperatures, standard deviation of test scores)
Hypothesis testing assesses whether observed differences or relationships are statistically significant
Null and alternative hypotheses are formulated, and an appropriate test statistic is selected (t-test for comparing means, chi-square test for independence)
P-values indicate the probability of observing the results if the null hypothesis is true (a p-value less than 0.05 is often considered statistically significant)

Regression Analysis and Machine Learning

Regression analysis models the relationship between a dependent variable and one or more independent variables
Linear regression is used for continuous outcomes (predicting housing prices based on square footage and number of bedrooms)
Logistic regression is used for binary outcomes (predicting the likelihood of a customer churning based on their demographics and purchase history)
Regression coefficients indicate the strength and direction of relationships (a positive coefficient suggests a positive association between variables)
Machine learning methods can uncover patterns and make data-driven predictions
Clustering algorithms group similar data points together (segmenting customers based on their browsing and purchase behavior)
Classification methods predict categorical outcomes (identifying spam emails based on text features)
Prediction models forecast future values or events (estimating future sales based on historical data and market trends)

Data Visualization Techniques

Data visualization communicates insights from the analysis clearly and effectively
Effective visualizations are accurate, clear, and tailored to the intended audience (simple charts for general readers, more complex diagrams for expert audiences)
Bar charts compare categories or show distributions (number of students in each grade level, percentage of respondents agreeing with a statement)
Line graphs display trends or changes over time (stock prices, daily temperature readings)
Scatterplots show relationships between two continuous variables (correlation between height and weight)
Maps display geographic patterns or spatial relationships (crime rates by neighborhood, election results by county)

Interpreting Data Insights

Contextualizing Findings

Interpretation involves explaining what the analysis results mean in the context of the original research question or story idea
Domain knowledge and critical thinking are required to assess the practical significance of the findings (understanding industry benchmarks, historical trends)
Putting the findings in a broader context helps to convey their significance (comparing results to national averages, discussing implications for policy or society)
Comparing the results to previous research or established knowledge can provide validation or highlight new insights (confirming earlier studies, identifying emerging trends)

Drawing Meaningful Conclusions

Conclusions should be supported by the evidence from the data analysis
Avoid overstating or generalizing the results beyond what the data can support (claiming causation when only correlation is shown)
Acknowledge limitations of the data and analysis (potential biases, missing data, uncontrolled confounding factors)
Discuss assumptions made in the analysis and their potential impact on the conclusions (assuming a linear relationship, excluding outliers)
Consider alternative explanations or interpretations of the findings (exploring competing hypotheses, discussing potential unmeasured variables)

Communicating Insights Effectively

Actionable insights and recommendations make the conclusions more impactful
Suggest policy changes or interventions based on the findings (recommending increased funding for programs that demonstrate positive outcomes)
Identify areas for further investigation or research (proposing follow-up studies to explore unexpected results or unanswered questions)
Provide guidance for decision-making or problem-solving (offering data-driven recommendations for resource allocation or targeted interventions)
Use clear language and avoid technical jargon to make the insights accessible to the intended audience (explaining statistical concepts in plain terms)
Provide examples or analogies to illustrate complex ideas (comparing percentages to fractions of a dollar, using relatable scenarios)
Employ storytelling techniques to make the findings memorable and engaging (highlighting individual cases, creating a narrative arc)

🪓Data Journalism Unit 2 Review

2.3 Data collection and analysis workflow

🪓Data Journalism Unit 2 Review

2.3 Data collection and analysis workflow

Unit & Topic Study Guides

Data Sources for Storytelling

Identifying Relevant Data

Assessing Data Quality and Credibility

Legal and Ethical Considerations

Data Cleaning and Validation

Data Cleaning Techniques

Data Processing and Transformation

Data Validation Techniques

Exploratory Data Analysis (EDA)

Data Analysis Techniques

Choosing Appropriate Analytical Methods

Descriptive and Inferential Statistics

Regression Analysis and Machine Learning

Data Visualization Techniques

Interpreting Data Insights

Contextualizing Findings

Drawing Meaningful Conclusions

Communicating Insights Effectively

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

🪓Data Journalism
Unit 2 Review