# C S 8A: INTRODUCTION TO DATA SCIENCE

## Foothill College Course Outline of Record

Heading | Value |
---|---|

Effective Term: |
Summer 2024 |

Units: |
4.5 |

Hours: |
4 lecture, 2 laboratory per week (72 total per quarter) |

Advisory: |
Students will benefit from some experience with computer programming or statistics; demonstrated proficiency in English by placement via multiple measures OR through an equivalent placement process OR completion of ESLL 125 & ESLL 249. |

Degree & Credit Status: |
Degree-Applicable Credit Course |

Foothill GE: |
Non-GE |

Transferable: |
CSU/UC |

Grade Type: |
Letter Grade (Request for Pass/No Pass) |

Repeatability: |
Not Repeatable |

## Student Learning Outcomes

- AÊsuccessfulÊstudent will be able transform raw data to a more interpretable format byÊcreating tables,Êcharts, and plots using a modernÊsoftware language.
- AÊsuccessfulÊstudent will be able to analyze data usingÊsimulation models andÊstatistical techniquesÊsuch asÊcalculation ofÊsummaryÊstatistics,Êcalculation ofÊconfidence intervals, and regression, and will be able to interpret findings from these techniques.
- AÊsuccessfulÊstudent will be able to explain key dataÊscienceÊconceptsÊsuch asÊcorrelation vs.Êcausation, randomness,Êsampling, and uncertainty.

## Description

## Course Objectives

The student will be able to:

- Write and execute code in an environment such as Jupyter notebook.
- Use expressions, variables, comparisons, control statements, iteration, arrays, and function calls in writing a computer program.
- Transform raw data into tables and manipulate data tables using a package such as pandas, babypandas, or datascience.
- Create and interpret a histogram, bar chart, line plot, and scatter plot.
- Define and use a function in a computer program.
- Group data by one or more attributes, apply a function ("split-apply-combine"), and interpret the results.
- Join structured data tables.
- Calculate probability that an event occurs and describe the situations where probabilities are added vs. multiplied.
- Explain randomness, sampling, probability distributions, and sample mean at an introductory level.
- Describe simulation models and the use of bootstrap.
- Design, perform, and interpret hypothesis tests using simulation models.
- Describe the meaning of variability in data.
- Describe the relationship between sample size, accuracy of an estimate, and margin of error in light of the central limit theorem.
- Calculate and interpret confidence intervals.
- Interpret correlation coefficients.
- Describe how linear and logistic regression can be used for predictive models.
- Describe the general workings of classification.
- Distinguish between causation measured through randomized experiments vs. association observed and describe why trends do not necessarily describe causal scenarios.

## Course Content

- Observational and experimental data
- Treatment/variable/feature, observation, outcome, association
- Treatment group, control group, randomization, randomized controlled experiment/trial
- Comparison, causality

- Use of an environment like Jupyter notebook for writing and executing code
- Introduction to programming
- Expressions
- Named variables
- Call expressions
- Numeric and string data types
- Comparisons
- Arrays
- Conditional statements
- Iteration

- Tables
- Reading data into a table from a file
- Selecting columns
- Selecting rows by index or feature
- Sorting tables

- Data visualization
- Scatter plots, line plots, and bar charts
- Best practices
- Binning data
- Histograms
- Plotting more than one category with scatter plots, line plots, and bar charts

- Functions
- Signature
- Docstring
- Body
- Return statement

- Applying functions to data tables
- Applying a function to a column
- Classifying by one variable (split-apply-combine)
- Computing counts, summary statistics, or other operations by group
- Classifying by more than one variable
- Creating pivot tables
- Combining information from two or more tables using inner, outer, left, or right join functions

- Chance
- Probability as a fraction
- Multiplying probabilities
- Adding probabilities
- Probability of at least one event
- Randomness
- Use of random number generator

- Sampling and empirical distributions
- Sampling at random vs. deterministically
- Sampling with and without replacement
- Law of averages
- Creating a histogram of sampled values
- Uniform distribution
- Simulations using random sampling

- Testing hypotheses
- Comparing simulation results of numeric variables to expected distributions
- Comparing simulation results of categorical variables to expected distributions
- Statistical bias
- Null vs. alternative hypotheses
- Test statistics
- P-values

- Comparing samples
- Observational analysis with hypothesis testing
- Randomized controlled experiments
- Meta-analysis

- Estimation
- Percentiles
- Bootstrap
- Confidence intervals

- Central tendency and variability
- Mean
- Variability
- Standard deviation
- Normal distribution
- Central limit theorem

- Regression
- Correlation
- Linear regression
- Least squares
- Residuals
- Regression for prediction and inference
- Fitted values
- Interpretation of regression coefficients and confidence intervals

- Classification
- Training and testing datasets
- Classifier examples: nearest neighbor and decision trees
- Measuring accuracy

- Conditional probability
- Examples used throughout course
- Economic data
- Geographic data
- Document collections
- Social networks
- Public health
- Sports
- Law
- Medicine
- Science
- Literature

- Other data science issues
- Social and legal issues around data analysis
- Privacy
- Data ownership

## Lab Content

- Familiarization with an environment such as Jupyter
- Navigating the environment
- Running code
- Reading and understanding error messages

- Expressions
- Using mathematical expressions
- Defining variables

- Table operations
- Finding total number of columns and rows
- Filtering by columns and rows
- Creating tables by typing in values or by reading from files

- Data types and creating and extending tables
- String methods
- Converting between string and numeric data types
- Creating, operating on, and indexing arrays

- Functions and visualizations
- Calling functions
- Defining functions
- Making functions that call other functions
- Applying functions to columns of a table

- Visualizations
- Creating a histogram
- Creating a line plot
- Creating a scatter plot

- Conditional statements, iteration, simulation
- Writing conditional statements
- Creating loops
- Generating a random choice
- Producing random samples
- Building a simulation

- A/B testing
- Designing a simulation
- Choosing and applying a test statistic
- Interpreting the result

- Sample means
- Determining a sample mean from the results of a simulation
- Varying parameters in a simulation to demonstrate concepts related to the Central Limit Theorem
- Using bootstrapping to produce confidence intervals

- Regression
- Assessing correlation
- Fitting a best fit line to a scatter plot
- Using bootstrapping to produce a confidence interval on best fit line slope

- Conditional probability
- Other
- Importing code modules or libraries

## Special Facilities and/or Equipment

2. A Kubernetes-based deployment of JupyterHub or similar and an assignment server that loads assignments into the students' environment.

3. Student access to a computer lab with the latest version of Anaconda or similar and an appropriate web browser installed.

4. Website or course management system with an assignment posting component and a forum component (where students can discuss course material and receive help from the instructor). This applies to all sections, including on-campus (i.e., face-to-face) offerings.

5. When taught via distance learning, the college will provide a fully functional and maintained course management system through which the instructor and students can interact.

6. When taught via distance learning, students must have currently existing email accounts and ongoing access to computers with internet capabilities.

## Method(s) of Evaluation

Tests and quizzes

Laboratory assignments and projects which include source code, sample runs, and documentation

Written homework

Final examination

## Method(s) of Instruction

Lectures which include data science concepts, example code, and analysis of data science examples

Online labs (for all sections, including those meeting face-to-face/on-campus), consisting of:

1. A programming assignment webpage located on a college-hosted course management system or other department-approved internet environment. Here, the students will review the specification of each programming assignment and submit their completed lab work

2. A discussion webpage located on a college-hosted course management system or other department-approved internet environment. Here, students can request assistance from the instructor and interact publicly with other class members

Detailed review of programming assignments which includes model solutions and specific comments on the student submissions

In-person or online discussion which engages students and instructor in an ongoing dialog pertaining to all aspects of designing, implementing, and analyzing programs

When course is taught fully online:

1. Instructor-authored lecture materials, handouts, syllabus, assignments, tests, and other relevant course material will be delivered through a college-hosted course management system or other department-approved internet environment

2. Additional instructional guidelines for this course are listed in the attached addendum of CS department online practices

## Representative Text(s) and Other Materials

Adhikari, Ani, John DeNero, and David Wagner. __Computational and Inferential Thinking: The Foundations of Data Science__. 2022.

## Types and/or Examples of Required Reading, Writing, and Outside of Class Assignments

- Reading
- Textbook assigned reading averaging 30 pages per week
- Reading the supplied handouts and modules averaging 10 pages per week
- Reading online resources as directed by instructor though links pertinent to programming
- Reading library and reference material directed by instructor through course handouts

- Writing
- Writing technical prose documentation that supports and describes the programs that are submitted for grades