COMP09016 2019 Introductory Programming for Data Science

General Details

Full Title
Introductory Programming for Data Science
Transcript Title
Introductory Programming for D
Code
COMP09016
Attendance
N/A %
Subject Area
COMP - 0613 Computer Science
Department
COEL - Computing & Electronic Eng
Level
09 - Level 9
Credit
05 - 05 Credits
Duration
Semester
Fee
Start Term
2019 - Full Academic Year 2019-20
End Term
9999 - The End of Time
Author(s)
Colm Davey, Saritha Unnikrishnan, Donny Hurley
Programme Membership
SG_KDATA_M09 201900 Master of Science in Data Science SG_KDATA_M09 202000 Master of Science in Data Science SG_KDATA_M09 202100 Master of Science in Computing (Data Science) SG_KCOMP_N09 202300 Postgraduate Certificate in Computing (Data Science)
Description

Programming for Data Science will introduce the learner to the core concepts of data science programming. The student will be introduced to the Python programming language (specifically SciPy) generally, and will employ functions to manipulate lists, before implementing multi-dimensional arrays using Numpy or similar in order to perform statistical operations and linear equations. The student will then manipulate data frames and time-series data using pandas or similar. The student will learn techniques for reading in data from multiple sources, scraping data from APIs and unstructured websites. Databases will be introduced with SQL programming. Creating, modifying and querying the database through Python and preparing the database results within the SciPy structures for future data analysis. The module will assume the student will have some experience with at least one programming language.

Learning Outcomes

On completion of this module the learner will/should be able to;

1.

Create and manipulate vectors, matrices and n-dimensional tensors using a data science programming language  

2.

Employ appropriate functions to implement linear algebra and statistical procedures 

3.

Employ appropriate packages to create, read and manipulate tabular and time-series data  

4.

Evaluate and implement techniques to gather and store information from various unstructured data sources  

5.

Design and deploy database systems ensuring durability, high availability and high performance

6.

Interrogate database systems using an appropriate querying language  

7.

Describe techniques to analyse the efficiency of algorithms and to compare the effectiveness of different algorithms  

Teaching and Learning Strategies

Lectures for key topics.

Jupyter Workbooks to demonstrate Python code and to solve problems.

Module Assessment Strategies

3 assignments that assess the learner over the three topics of Python, Database and Algorithms. A final project to create a solution that satisfies all LOs for a data science specific use case, provided by the learner.

20% Python assignment – this will involve completing a Jupyter workbook. The workbook will test the student’s knowledge of basic Python (deploying methods), using SciPy and Pandas to read in files. The assignment will have short questions that will be filled in on the workbook.    

20% Database assignment – this will involve creating a database and then populating the tables from an external source. Queries will be designed for specific uses of the database 

10% Algorithm assignment – Run tests on different algorithms, analysing the results and commenting on the efficiency 

50% project to create a solution satisfying all LOs, for a data science specific use case, provided by the learner  

Repeat Assessments

The student must submit a new project and/or assignments at the next available time.

Indicative Syllabus

Python Introduction 

  • Review common Python functionality – show the differences to other programming languages and “Pythonic” style of programming 

  • Introduced to the Jupyter Notebook and Spyder IDE for Python programming 

SciPy 

  • Numpy and its data structures 

  • Matplotlib for simple data presentations and visualisations 

  • Linalg for manipulating vectors and matrices, implementing linear algebra techniques and solving problems such as computing eigenvalues/eigenvectors 

  • Stats package for generating summary statistics, calculating p-values and running statistical tests – these will show how to perform the tasks in the concurrently running Applied Probability and Statistics module 

Pandas 

  • Learn how to read in data into DataFrame structures from difference files sources such as CSV, JSON, XML. 

  • How to query these DataFrame structures 

  • The details about how such structures are indexed 

  • Merge DataFrames, generate summary tables 

Data Gathering 

  • Tools to query APIs 

  • Converting Data received into a format that can be used directly with Pandas DataFrames 

  • Using BeautifulSoup to scrape unstructured webpages 

  • Techniques to ensure consistency of data and amalgamating data where appropriate 

  • Group data into logical pieces and manipulate dates 

Database 

  • Compare strengths/weaknesses of different Database Management Systems (DBMS) – SQLite, PostgeSQL, MySQL, SQL Server 

  • Deploy a DBMS 

  • Create a simple database structure  

  • Learn some of the basic SQL statements, write and practice basic SQL hands-on on a live database 

  • How to use string patterns and ranges to search data 

  • How to sort and group data in result sets 

  • Learn how to work with multiple tables in a relational database using join operations 

  • Using Python to connect to databases and then create tables, load data, query data using SQL and analyse data using Python 

  • Reading data directly into Pandas DataFrames or other appropriate data structures 

  • Insert results into the database from Python structures 

Algorithms 

  • Discuss the need for efficiency for algorithms, particularly for Data Science 

  • Big O notation and how to compare algorithms 

  • Demonstrate the difference between different algorithm speeds and the potential real time analysis of data 

Coursework & Assessment Breakdown

Coursework & Continuous Assessment
100 %

Coursework Assessment

Title Type Form Percent Week Learning Outcomes Assessed
1 Python Assignment Coursework Assessment Assignment 20 % Week 6 1,2,3,4
2 Database Assignment Coursework Assessment Assignment 20 % Week 11 5,6
3 Algorithm Assignment Coursework Assessment Assignment 10 % Week 13 7
4 Project Project Individual Project 50 % End of Semester 1,2,3,4,5,6

Full Time Mode Workload


Type Location Description Hours Frequency Avg Workload
Lecture Lecture Theatre Lecture 1 Weekly 1.00
Practical / Laboratory Computer Laboratory Practical 2 Weekly 2.00
Independent Learning Not Specified Directed Learning 4 Weekly 4.00
Total Full Time Average Weekly Learner Contact Time 3.00 Hours

Online Learning Mode Workload


Type Location Description Hours Frequency Avg Workload
Online Lecture Distance Learning Suite Lecture 1.5 Weekly 1.50
Independent Learning Not Specified Independent Learning 5.5 Weekly 5.50
Total Online Learning Average Weekly Learner Contact Time 1.50 Hours

Required & Recommended Book List

Required Reading
2017-06-07 Python Data Science Handbook, Jake VanderPlas, 2017 Bukupedia
ISBN 9781491912058 ISBN-13 1491912057

This is a book about doing data science with Python, which immediately begs the question: what is data science? Its a surprisingly hard definition to nail down, especially given how ubiquitous the term has become. Vocal critics have variously dismissed the term as a superfluous label (after all, what science doesnt involve data?) or a simple buzzword that only exists to salt rsums and catch the eye of overzealous tech recruiters. In my mind, these critiques miss something important. Data science, despite its hypeladen veneer, is perhaps the best label we have for the cross-disciplinary set of skills that are becoming increasingly important in many applications across industry and academia. This cross-disciplinary piece is key: in my mind, the best existing definition of data science is illustrated by Drew Conways Data Science Venn Diagram, first published on his blog in September 2010 While some of the intersection labels are a bit tongue-in-cheek, this diagram captures the essence of what I think people mean when they say data science: it is fundamentally an interdisciplinary subject. Data science comprises three distinct and overlapping areas: the skills of a statistician who knows how to model and summarize datasets (which are growing ever larger); the skills of a computer scientist who can design and use algorithms to efficiently store, process, and visualize this data; and the domain expertisewhat we might think of as classical training in a subjectnecessary both to formulate the right questions and to put their answers in context. With this in mind, I would encourage you to think of data science not as a new domain of knowledge to learn, but as a new set of skills that you can apply within your current area of expertise. Whether you are reporting election results, forecasting stock returns, optimizing online ad clicks, identifying microorganisms in microscope photos, seeking new classes of astronomical objects, or working with data in any other field, the goal of this book is to give you the ability to ask and answer new questions about your chosen subject area. Who Is This Book For? In my teaching both at the University of Washington and at various tech-focused conferences and meetups, one of the most common questions I have heard is this: how should I learn Python? The people asking are generally technically minded students, developers, or researchers, often with an already strong background in writing code and using computational and numerical tools. Most of these folks dont want to learn Python per se, but want to learn the language with the aim of using it as a tool for data-intensive and computational science. While a large patchwork of videos, blog posts, and tutorials for this audience is available online, Ive long been frustrated by the lack of a single good answer to this question; that is what inspired this book. The book is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks. Instead, it is meant to help Python users learn to use Pythons data science stacklibraries such as IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related toolsto effectively store, manipulate, and gain insight from data. Why Python? Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets. This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind. The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages: NumPy for manipulation of homogeneous arraybased data, Pandas for manipulation of heterogeneous and labeled data, SciPy for common scientific computing tasks, Matplotlib for publication-quality visualizations, IPython for interactive execution and sharing of code, Scikit-Learn for machine learning, and many more tools that will be mentioned in the following pages. If you are looking for a guide to the Python language itself, I would suggest the sister project to this book, A Whirlwind Tour of the Python Language. This short report provides a tour of the essential features of the Python language, aimed at data scientists who already are familiar with one or more other programming languages. Python 2 Versus Python 3 This book uses the syntax of Python 3, which contains language enhancements that are not compatible with the 2.x series of Python. Though Python 3.0 was first released in 2008, adoption has been relatively slow, particularly in the scientific and web development communities. This is primarily because it took some time for many of the essential third-party packages and toolkits to be made compatible with the new language internals. Since early 2014, however, stable releases of the most important tools in the data science ecosystem have been fully compatible with both Python 2 and 3, and so this book will use the newer Python 3 syntax. However, the vast majority of code snippets in this book will also work without modification in Python 2: in cases where a Py2-incompatible syntax is used, I will make every effort to note it explicitly. Outline of This Book Each chapter of this book focuses on a particular package or tool that contributes a fundamental piece of the Python data science story. IPython and Jupyter (Chapter 1) These packages provide the computational environment in which many Pythonusing data scientists work. NumPy (Chapter 2) This library provides the ndarray object for efficient storage and manipulation of dense data arrays in Python. Pandas (Chapter 3) This library provides the DataFrame object for efficient storage and manipulation of labeled/columnar data in Python. Matplotlib (Chapter 4) This library provides capabilities for a flexible range of data visualizations in Python.

Required Reading
2017-10 Python for Data Analysis O'Reilly Media
ISBN 1491957662 ISBN-13 9781491957660

Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.6, the second edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You'll learn the latest versions of pandas, NumPy, IPython, and Jupyter in the process. Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It's ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub. Use the IPython shell and Jupyter notebook for exploratory computing Learn basic and advanced features in NumPy (Numerical Python) Get started with data analysis tools in the pandas library Use flexible tools to load, clean, transform, merge, and reshape data Create informative visualizations with matplotlib Apply the pandas groupby facility to slice, dice, and summarize datasets Analyze and manipulate regular and irregular time series data Learn how to solve real-world data analysis problems with thorough, detailed examples

Required Reading
2017-08-29 Practical SQL
ISBN 1593278276 ISBN-13 9781593278274

Practical SQL is an approachable and fast-paced guide to SQL (Structured Query Language), the standard programming language for defining, organizing, and exploring data in relational databases. The book focuses on using SQL to find the story your data tells, with the popular open-source database PostgreSQL and the pgAdmin interface as its primary tools. You'll first cover the fundamentals of databases and the SQL language, then build skills by analyzing data from the U.S. Census and other federal and state government agencies. With exercises and real-world examples in each chapter, this book will teach even those who have never programmed before all the tools necessary to build powerful databases and access information quickly and efficiently. You'll learn how to: -Create databases and related tables using your own data -Define the right data types for your information -Aggregate, sort, and filter data to find patterns -Use basic math and advanced statistical functions -Identify errors in data and clean them up -Import and export data using delimited text files -Write queries for geographic information systems (GIS) -Create advanced queries and automate tasks Learning SQL doesn't have to be dry and complicated. Practical SQL delivers clear examples with an easy-to-follow approach to teach you the tools you need to build and manage your own databases. This book uses PostgreSQL, but the SQL syntax is applicable to many database applications, including Microsoft SQL Server and MySQL.

Required Reading
2003 PostgreSQL Sams Publishing
ISBN 0735712573 ISBN-13 9780735712577

"PostgreSQL" leads users through the internals of an open-source database. Throughout the book are explanations of data structures and algorithms, each backed by a concrete example from the actual source code. Each section contains information about performance implications, debugging techniques, and pointers to more information (on the Web and in book form).

Module Resources