Tables from our analysis
Classes Offered by Subject at ACEJMC-Accredited Journalism Programs
|Number of Classes||Number of Programs||Percent of Total|
|Three or more classes||18||16%|
Classes with Data Journalism as a Component
|Number of Classes||Number of Programs||Percent of Total|
|Four or More Classes||7||6%|
|Number of Classes||Number of Programs||Percent of total|
|Four or more classes||34||30%|
Programming Beyond HTML/CSS
|Number of Classes||Number of Programs||Percent of Total|
|Three or More Classes||3||3%|
Note: This analysis of programming classes is focused on those courses taught within a journalism program. It should be noted that a fair number of schools pointed to collaborations with other departments where journalism students were able to take advanced programming or computer science classes.
Below we list several examples, for reference, of stories that are emblematic of the categories we define in Chapter 1.
- “Drugging Our Kids,” San Jose Mercury News, 2014
- “Methadone and the Politics of Pain,“ The Seattle Times, 2012
Data Visualization and Interactives
- ProPublica’s “Dollars for Docs,” 2010
- The Washington Post‘s visualization of the missing Malaysian jet, 2014
Emerging Journalistic Technologies
- “Tanzania: Initiative to Stop the Poaching of Elephants,” CCTV Africa, 2014
- Because of regulatory issues with the Federal Aviation Administration, the use of drones for journalism is not widespread in spite of significant interest on the part of industry and academia. Uses foreseen when regulations become more permissible include news photography and videography, scanning news locations for use in 3D models and 360-degree video applications, remotely sensed data gathering through visible images or multispectral images, mapping of areas of interest at higher temporal resolutions than currently available and as sensor distributors or sensor-based data gatherers.
- WNYC’s Cicada Tracker project in 2013 recruited interested listeners to use sensors to identify where cicadas would emerge.
- USA Today’s “Ghost Factories” investigation in 2012 used X-ray gun sensors to scan the soil.
- The Houston Chronicle’s 2005 investigative story “In Harm’s Way” used sensors to examine air quality near oil refineries and factories.
Virtual and Augmented Reality Examples:
- The New York Times sent out more than a million Google Cardboard kits to subscribers in 2015 as it launched its first VR story, “The Displaced,” a piece detailing children displaced by war.
- Stanford University’s Department of Communication, home to the Stanford Virtual Human Interaction Lab, has scheduled a VR class for the winter 2016 quarter as part of its curriculum for its master’s in journalism program.
- The 2014 Wall Street Journal investigation into Medicare
- “The Echo Chamber,” a 2014 Reuters investigation into influence at the Supreme Court
- PDF repository DocumentCloud or Overview, developed by Jonathan Stray
Tools, Resources and Methods Discussed in the Report
The ethics of software may also shape decisions about the tools and techniques you teach. “Free” software is licensed in an effort to promote freedom of computing, in a manner analogous to freedom of speech. Free software may be copied, altered, used, and shared freely. A related form of software licensing, titled “open source,” is very similar to free software, but instead emphasizes the public availability of code.
Proprietary software may also have certain advantages. Often the interface design is more polished, support services are provided, and in some cases they simply run better on demanding tasks.
But the gap between free and proprietary software has become narrower in recent years, and many professionals in fact prefer to use free and open source software on more than ideological grounds. F/OSS software is often more secure because it can be openly vetted by security researchers. For the same reason, particularly popular applications may have many talented and dedicated developers, as well as a support community of fellow users rather than call center or online service.
Given the expense of proprietary software and its inevitable obsolescence, there are few advantages to using these applications in data and computation classes instead of free and open-source ones.
Guide to Common Tools for Data and Computational Journalism
The following list of common tools for data and computational journalism is quoted from the Lede Program at Columbia.
Git is something called a version control system—it’s not a programming language, but programmers use it often. Version control is a way of keeping track of the history of your code, along with providing a structure that encourages collaboration. GitHub is a popular cloud-based service that makes use of git, and we make heavy use of it during the Lede Program.
HTML isn’t technically a programming language, it’s a markup language. A HyperText Markup Language, to be exact. HTML is used to explain what different parts of web pages are to your browser, and you use it extensively when learning to scrape web pages.
Python is a multipurpose programming language that is at home crunching, parsing text, or building Twitter bots. We use Python extensively in the Lede.
R is a programming language that is used widely for mathematical and statistical processing.
Tools for Data and Analysis
Beautiful Soup and lxml are tools used for taking data from the Web and making it accessible to your computer.
IPython Notebooks are an interactive programming environment that encourage documentation, transparency, and reproducibility of work. When you’re done with your analysis, you’ll be able to put your work up for everyone to see—and check!
NLTK (Natural Language Toolkit) is a Python library built to process large amounts of text. Whether you’re analyzing congressional bills, Twitter outrages, or Shakespearean plays, NLTK has you covered.
OpenRefine (previously Google Refine) is downloadable software that helps you sort and sift dirty data, cleaning it to the point where you can start your actual analysis.
Pandas is a high-performance data analysis tool for Python.
QGIS (geographic information system) is an open-source tool used to work with geographic data, from reprojecting and combining data sets to running analyses and making visualizations.
Scitkit-learn is a Python package for machine learning and data analysis. It’s the Swiss Army knife of data science: it covers classification, regression, clustering, dimensionality reduction, and so much more.
Web scraping is the process of taking information off of websites and making use of it on your computer. A lot of times documents aren’t easily available in accessible formats, and you need to scrape them in order to process and analyze them.
An API (application programming interface) is a way for computers to communicate to one another. For us, this generally means sharing data. We’ll be coding up Python scripts to talk to and request data from machines around the world, from Twitter to the U.S. government.
CSVs (comma-separated values) are the most common format for data. It’s a quick export away from Excel or Google Spreadsheets, and you’ll find yourself working from CSVs more often than any other format. Although “comma-separated” is in the name, a CSV can arguably also use tabs, pipes, or any other character as a field delimiter (although the tab-separated one can also be called a TSV).
GeoJSON and Topojson are specifically formatted JSON files that contain geographic data.
SQL (Structured Query Language) is a language to talk to databases. You’ll sometimes find data sets in SQL format, ready to be imported into your database system of choice.
Tech Team Report
Another useful resource for understanding the tools of data journalism was prepared at Stanford by an interdisciplinary team of computer science and data journalism students in a Spring 2015 course on watchdog reporting. The report is available here: http://cjlab.stanford.edu/tech-team-report/
Online courses and MOOCs
- Doing Journalism with Data: First Steps, Skills and Tools (http://datajournalismcourse.net/ )
- School of Data (http://schoolofdata.org/)
- The Knight Center for Journalism in the Americas offers a number of MOOCs as distance learning for journalists (https://knightcenter.utexas.edu/distancelearning)
Useful Data Sets for Classwork and Assignments
- Baby name census data—clean data, always varies from year to year, papers always cover it (Top 1000 baby names by year can be found at https://www.ssa.gov/oact/babynames/limits.html
- Greenhouse gas data (NOAA has a number of searchable datasets at http://www.esrl.noaa.gov/gmd/dv/data/)
- Student grade distributions for your college
- This is a small data set used in a lot of the School of Data Examples: The GRAIN database of land grabs (http://datahub.io/dataset/grain-landgrab-data/resource/af57b7b2-f4e7-4942-88d3-83912865d116)
- World Bank Open Data (http://data.worldbank.org/)
- The Guardian Databases (http://www.theguardian.com/news/datablog/interactive/2013/jan/14/all-our-datasets-index)
- The Eurostat Databases (http://ec.europa.eu/eurostat/help/new-eurostat-website)
- UK Government Databases (https://data.gov.uk/data/search?res_format=RSS)
- National and International Statistical Services by region and country (https://en.wikipedia.org/wiki/List_of_national_and_international_statistical_services)
- Global Health Observatory Data Repository (http://apps.who.int/gho/data/node.home
- Business Registry Databases (https://www.investigativedashboard.org/business_registries/)
- Google’s list of Public Data (http://www.google.com/publicdata/directory#)
- Open Spending (https://openspending.org/)
- Datahub (http://datahub.io/)
- Open Access Directory (http://oad.simmons.edu/oadwiki/Main_Page)
- Data Portals (http://dataportals.org/)
- NASA’s Data Portal (https://data.nasa.gov/)
Philip Meyer’s recommended texts
- John Tukey, Exploratory Data Analysis (Upper Saddle River, NJ: Pearson Education. 1977)
- James A. Davis, The Logic of Causal Order (Thousand Oaks, CA: Sage, 1985)
- Robert P. Abelson, Statistics as Principled Argument (Hillsdale, NJ: Lawrence Erlbaum Associates, 1995)
Data Journalism Articles, Projects, and Reading Lists Used in Instruction
- Cairo, Alberto. “Recommended Resources for My Infographics and Visualization Courses.” Personal. The Functional Art: An Introduction to Information Graphics and Visualization, October 11, 2012. http://www.thefunctionalart.com/2012/10/recommended-readings-for-infographics.html.
- “Cameroon—Cameroon Budget Inquirer.” Accessed September 23, 2015. http://cameroon.openspending.org/en/.
- Downs, Kat, Dan Hill, Ted Mellnik, Andrew Metcalf, Cory O’Brien, Cheryl Thompson, and Serdar Tumgoren. “Homicides in the District of Columbia—The Washington Post.” News. The Washington Post, October 14, 2012. http://apps.washingtonpost.com/investigative/homicides/.
- “Find My School .Ke.” Accessed September 23, 2015. http://findmyschool.co.ke/.
- Keefe, John, Steven Melendez, and Louise Ma. “Flooding and Flood Zones | WNYC.” News. WNYC. Accessed September 23, 2015. http://project.wnyc.org/flooding-sandy-new/index.html.
- Kirk, Chris, and Dan Kois. “How Many People Have Been Killed by Guns Since Newtown?” Slate, September 16, 2013. http://www.slate.com/articles/news_and_politics/crime/2012/12/gun_death_tally_every_american_gun_death_since_newtown_sandy_hook_shooting.html.
- Lewis, Jason. “Revealed: The £1 Billion High Cost Lending Industry | The Bureau of Investigative Journalism.” Journalism. The Bureau of Investigative Journalism, June 13, 2013. https://www.thebureauinvestigates.com/2013/06/13/revealed-the-1billion-high-cost-lending-industry/.
- Nguyen, Dan. “Who in Congress Supports SOPA and PIPA/PROTECT-IP? | SOPA Opera.” News. ProPublica, January 20, 2012. http://projects.propublica.org/sopa/.
- Rogers, Simon. “Government Spending by Department, 2011-12: Get the Data.” The Guardian, December 4, 2012, sec. UK news. http://www.theguardian.com/news/datablog/2012/dec/04/government-spending-department-2011-12.
- ———. “John Snow’s Data Journalism: The Cholera Map That Changed the World.” The Guardian, March 15, 2013, sec. News. http://www.theguardian.com/news/datablog/2013/mar/15/john-snow-cholera-map.
- ———. “Wikileaks Data Journalism: How We Handled the Data.” The Guardian, January 31, 2011, sec. News. http://www.theguardian.com/news/datablog/2011/jan/31/wikileaks-data-journalism.
- ———. “Wikileaks Iraq War Logs: Every Death Mapped.” The Guardian, October 22, 2010. http://www.theguardian.com/world/datablog/interactive/2010/oct/23/wikileaks-iraq-deaths-map.
- Rogers, Simon, and John Burn-Murdoch. “Superstorm Sandy: Every Verified Event Mapped and Detailed.” The Guardian, October 30, 2012. http://www.theguardian.com/news/datablog/interactive/2012/oct/30/superstorm-sandy-incidents-mapped.
- Serra, Laura, Maia Jastreblansky, Ivan Ruiz, Ricardo Brom, and Mariana Trigo Viera. “Argentina’s Senate Expenses 2004-2013.” News. La Nacion, April 3, 2013. http://blogs.lanacion.com.ar/ddj/data-driven-investigative-journalism/argentina-senate-expenses/.
- Shaw, Al, Jeremy B. Merrill, and Zamora, Amanda. “Free the Files: Help ProPublica Unlock Political Ad Spending.” ProPublica, September 4, 2015. https://projects.propublica.org/free-the-files/.
- “Where Does My Money Go?” Accessed September 23, 2015. http://wheredoesmymoneygo.org/.
Lede Program Curriculum
The Lede Program at Columbia Journalism School is a post-baccalaureate in which students from a variety of backgrounds learn data and computation skills over the course of one or two semesters. The program was designed to help students rapidly elevate their skills in these areas, especially if they were considering applying for Columbia’s highly demanding dual-degree program in journalism and computer science.
In the context of this report, the one-semester version of the Lede represents a promising “extended boot camp” in which students who have been accepted into a data journalism master’s program may attend for a full summer before their peers in order to develop the skills that will help them get the most out of their education.
The following course descriptions were pulled on November 5, 2015, from: http://www.journalism.columbia.edu/page/1060-the-lede-program-courses/908
Foundations of Computing
During this introduction to the ins and outs of the Python programming language, students build a foundation upon which their later, more coding-intensive classes will depend. Dirty, real-world data sets will be cleaned, parsed and processed while recreating modern journalistic projects. The course will also touch upon basic visualization and mapping, and how to use public resources such as Google and Stack Overflow to build self-reliance.
Focus: Familiarize yourself with the data-driven landscape
Topics & tools include: Python, basic statistical analysis, OpenRefine, CartoDB, pandas, HTML, CSVs, algorithmic story generation, narrative workflow, csvkit, git/GitHub, Stack Overflow, data cleaning, command line tools, and more
Data and Databases
Students will become familiar with a variety of data formats and methods for storing, accessing and processing information. Topics covered include comma-separated documents, interaction with website APIs and JSON, raw-text document dumps, regular expressions, text mining, SQL databases, and more. Students will also tackle less accessible data by building web scrapers and converting difficult-to-use PDFs into useable information.
Focus: Finding and working with data
Topics & tools include: SQL, APIs, CSVs, regular expressions, text mining, PDF processing, pandas, Python, HTML, Beautiful Soup, IPython Notebooks, and more
Machine learning and data science are integral to processing and understanding large data sets. Whether you’re clustering schools or crime data, analyzing relationships between people or businesses, or searching for a single fact in a large data set, algorithms can help. Through supervised and unsupervised learning, students will generate leads, create insights, and figure out how to best focus their efforts with large data sets. A critical eye toward applications of algorithms will also be developed, uncovering the pitfalls and biases to look for in your own and others’ work.
Focus: Analyzing your data
Topics & tools include: linear regression, clustering, text mining, natural language processing, decision trees, machine learning, scikit-learn, Python, and more
Data Analysis Studio
In this project-driven course, students refine their creative workflow on personal work, from obtaining and cleaning data to final presentation. Data is explored not only as the basis for visualization, but also as a lead-generating foundation, requiring further investigative or research-oriented work. Regular critiques from instructors and visiting professionals are a critical piece of the course.
Focus: Applying your skillset
Topics & tools include: Tableau, web scraping, mapping, CartoDB, GIS/QGIS, data cleaning, documentation, and more