C S 8A: INTRODUCTION TO DATA SCIENCE

Foothill College Course Outline of Record

Foothill College Course Outline of Record
Effective Term: Summer 2024
Units: 4.5
Hours: 4 lecture, 2 laboratory per week (72 total per quarter)
Advisory: Students will benefit from some experience with computer programming or statistics; demonstrated proficiency in English by placement via multiple measures OR through an equivalent placement process OR completion of ESLL 125 & ESLL 249.
Degree & Credit Status: Degree-Applicable Credit Course
Foothill GE: Non-GE
Transferable: CSU/UC
Repeatability: Not Repeatable

Student Learning Outcomes

• AÊsuccessfulÊstudent will be able transform raw data to a more interpretable format byÊcreating tables,Êcharts, and plots using a modernÊsoftware language.
• AÊsuccessfulÊstudent will be able to analyze data usingÊsimulation models andÊstatistical techniquesÊsuch asÊcalculation ofÊsummaryÊstatistics,Êcalculation ofÊconfidence intervals, and regression, and will be able to interpret findings from these techniques.
• AÊsuccessfulÊstudent will be able to explain key dataÊscienceÊconceptsÊsuch asÊcorrelation vs.Êcausation, randomness,Êsampling, and uncertainty.

Description

Introduction to the fundamental concepts and computational skills needed to understand and analyze data arising from real-world phenomena. Topics include key data science concepts such as correlation vs. causation, randomness, sampling, uncertainty, predictive models, and classification. Using a tool such as Jupyter notebooks, students write code for transformation and use of data tables, simulation models, and A/B testing.

Course Objectives

The student will be able to:

1. Write and execute code in an environment such as Jupyter notebook.
2. Use expressions, variables, comparisons, control statements, iteration, arrays, and function calls in writing a computer program.
3. Transform raw data into tables and manipulate data tables using a package such as pandas, babypandas, or datascience.
4. Create and interpret a histogram, bar chart, line plot, and scatter plot.
5. Define and use a function in a computer program.
6. Group data by one or more attributes, apply a function ("split-apply-combine"), and interpret the results.
7. Join structured data tables.
8. Calculate probability that an event occurs and describe the situations where probabilities are added vs. multiplied.
9. Explain randomness, sampling, probability distributions, and sample mean at an introductory level.
10. Describe simulation models and the use of bootstrap.
11. Design, perform, and interpret hypothesis tests using simulation models.
12. Describe the meaning of variability in data.
13. Describe the relationship between sample size, accuracy of an estimate, and margin of error in light of the central limit theorem.
14. Calculate and interpret confidence intervals.
15. Interpret correlation coefficients.
16. Describe how linear and logistic regression can be used for predictive models.
17. Describe the general workings of classification.
18. Distinguish between causation measured through randomized experiments vs. association observed and describe why trends do not necessarily describe causal scenarios.

Course Content

1. Observational and experimental data
1. Treatment/variable/feature, observation, outcome, association
2. Treatment group, control group, randomization, randomized controlled experiment/trial
3. Comparison, causality
2. Use of an environment like Jupyter notebook for writing and executing code
3. Introduction to programming
1. Expressions
2. Named variables
3. Call expressions
4. Numeric and string data types
5. Comparisons
6. Arrays
7. Conditional statements
8. Iteration
4. Tables
1. Reading data into a table from a file
2. Selecting columns
3. Selecting rows by index or feature
4. Sorting tables
5. Data visualization
1. Scatter plots, line plots, and bar charts
2. Best practices
3. Binning data
4. Histograms
5. Plotting more than one category with scatter plots, line plots, and bar charts
6. Functions
1. Signature
2. Docstring
3. Body
4. Return statement
7. Applying functions to data tables
1. Applying a function to a column
2. Classifying by one variable (split-apply-combine)
3. Computing counts, summary statistics, or other operations by group
4. Classifying by more than one variable
5. Creating pivot tables
6. Combining information from two or more tables using inner, outer, left, or right join functions
8. Chance
1. Probability as a fraction
2. Multiplying probabilities
4. Probability of at least one event
5. Randomness
6. Use of random number generator
9. Sampling and empirical distributions
1. Sampling at random vs. deterministically
2. Sampling with and without replacement
3. Law of averages
4. Creating a histogram of sampled values
5. Uniform distribution
6. Simulations using random sampling
10. Testing hypotheses
1. Comparing simulation results of numeric variables to expected distributions
2. Comparing simulation results of categorical variables to expected distributions
3. Statistical bias
4. Null vs. alternative hypotheses
5. Test statistics
6. P-values
11. Comparing samples
1. Observational analysis with hypothesis testing
2. Randomized controlled experiments
3. Meta-analysis
12. Estimation
1. Percentiles
2. Bootstrap
3. Confidence intervals
13. Central tendency and variability
1. Mean
2. Variability
3. Standard deviation
4. Normal distribution
5. Central limit theorem
14. Regression
1. Correlation
2. Linear regression
3. Least squares
4. Residuals
5. Regression for prediction and inference
6. Fitted values
7. Interpretation of regression coefficients and confidence intervals
15. Classification
1. Training and testing datasets
2. Classifier examples: nearest neighbor and decision trees
3. Measuring accuracy
16. Conditional probability
17. Examples used throughout course
1. Economic data
2. Geographic data
3. Document collections
4. Social networks
5. Public health
6. Sports
7. Law
8. Medicine
9. Science
10. Literature
18. Other data science issues
1. Social and legal issues around data analysis
2. Privacy
3. Data ownership

Lab Content

1. Familiarization with an environment such as Jupyter
1. Navigating the environment
2. Running code
3. Reading and understanding error messages
2. Expressions
1. Using mathematical expressions
2. Defining variables
3. Table operations
1. Finding total number of columns and rows
2. Filtering by columns and rows
3. Creating tables by typing in values or by reading from files
4. Data types and creating and extending tables
1. String methods
2. Converting between string and numeric data types
3. Creating, operating on, and indexing arrays
5. Functions and visualizations
1. Calling functions
2. Defining functions
3. Making functions that call other functions
4. Applying functions to columns of a table
6. Visualizations
1. Creating a histogram
2. Creating a line plot
3. Creating a scatter plot
7. Conditional statements, iteration, simulation
1. Writing conditional statements
2. Creating loops
3. Generating a random choice
4. Producing random samples
5. Building a simulation
8. A/B testing
1. Designing a simulation
2. Choosing and applying a test statistic
3. Interpreting the result
9. Sample means
1. Determining a sample mean from the results of a simulation
2. Varying parameters in a simulation to demonstrate concepts related to the Central Limit Theorem
3. Using bootstrapping to produce confidence intervals
10. Regression
1. Assessing correlation
2. Fitting a best fit line to a scatter plot
3. Using bootstrapping to produce a confidence interval on best fit line slope
11. Conditional probability
12. Other
1. Importing code modules or libraries

Special Facilities and/or Equipment

1. Instructor access to a cloud provider such as Google Cloud, Microsoft Azure, Amazon EC2, or IBM Cloud.
2. A Kubernetes-based deployment of JupyterHub or similar and an assignment server that loads assignments into the students' environment.
4. Website or course management system with an assignment posting component and a forum component (where students can discuss course material and receive help from the instructor). This applies to all sections, including on-campus (i.e., face-to-face) offerings.
5. When taught via distance learning, the college will provide a fully functional and maintained course management system through which the instructor and students can interact.
6. When taught via distance learning, students must have currently existing email accounts and ongoing access to computers with internet capabilities.

Method(s) of Evaluation

Methods of Evaluation may include but are not limited to the following:

Tests and quizzes
Laboratory assignments and projects which include source code, sample runs, and documentation
Written homework
Final examination

Method(s) of Instruction

Methods of Instruction may include but are not limited to the following:

Lectures which include data science concepts, example code, and analysis of data science examples
Online labs (for all sections, including those meeting face-to-face/on-campus), consisting of:
1. A programming assignment webpage located on a college-hosted course management system or other department-approved internet environment. Here, the students will review the specification of each programming assignment and submit their completed lab work
2. A discussion webpage located on a college-hosted course management system or other department-approved internet environment. Here, students can request assistance from the instructor and interact publicly with other class members
Detailed review of programming assignments which includes model solutions and specific comments on the student submissions
In-person or online discussion which engages students and instructor in an ongoing dialog pertaining to all aspects of designing, implementing, and analyzing programs
When course is taught fully online:
1. Instructor-authored lecture materials, handouts, syllabus, assignments, tests, and other relevant course material will be delivered through a college-hosted course management system or other department-approved internet environment
2. Additional instructional guidelines for this course are listed in the attached addendum of CS department online practices

Representative Text(s) and Other Materials

Adhikari, Ani, John DeNero, and David Wagner. Computational and Inferential Thinking: The Foundations of Data Science. 2022.