Text | Details | Genomics | Topics | Your grade | Computer Science Home

Instructor: Mark LeBlanc
Mark's Web Page -- Email
Office: Science Center[B103]
Office Hours: by appt. or
    Mon 11:30-12:30, 3:30-4:30
    Wed 11:30-12:30
Phone: 286-3970
Class Meeting Times:
    MWF 9:30-10:20am, A118
Lab: Wed 3:30-5:20, csLab

" ... then (the algorithms of today) will become 'simple' problems and a new generation of challenges, which we can now only barely imagine, will take their place on the frontier of what it is possible to do with computers." Aho and Ullman, Foundations of Computer Science, p95.

Texts: Data Structures and Algorithm in C++, by Michael T. Goodrich, Roberto Tamassia, and David Mount. Algorithms Text John Wiley and Sons, 2004. Required. An excellent introduction to the importance of and rigor in algorithm analysis.

Applications Programming in ANSI C (3rd Ed.) by Johnsonbaugh and Kalin. ANSI C text Prentice Hall, 1996. Required ... because everyone needs a good reference on C and this is a very good book; you'll pull it down from the shelf a lot

Selected Theme Readings in Genomics:

Krane, Dan E. and Raymer, Michael L. (2003). Krane and Raymer textFundamental Concepts of Bioinformatics. Benjamin-Cummins, Boston, MA.

Developing Bioinformatics Computing Skills, OReilly text Cynthia Gibas & Per Jambeck. O'Reilly, 2001. Ch. 1, 2, and 7 especially good.

Goodman, Nat. (2002). View from the waterworks: the world of NCBI. Genome Technology, No. 23, July 2002, p 74-79.

Other good books in the Library and/or in the csLab include:

Introduction to Algorithms, (2nd edition) by Cormen, Leiserson, Rivest, and Stein. CLR text cover will go here ... McGraw Hill, 2001. (Highly recommended when you need more depth in a particular topic).

The Algorithm Design Manual, by Steven S. Skiena, Skiena text cover will go here ... Springer-Verlag, 1998. An excellent starting point for a wide range of problems.

Content of the course: An introduction to the mathematical foundations, design, implementation and computational analysis of fundamental algorithms. Problems include heuristic searching, sorting, several graph theory problems, string matching, and the theoretical expression of their orders of growth. Out-of-class assignments and hands-on labs emphasize the balance between theoretical hypotheses (that is you figure out on a napkin how long an algorithm will take to run ) and experimental verification (if your napkin math says the algorithm will finish in a reasonable time, then you implement a program to actually run an experiment).

Theme: The Fall 2003 theme throughout the course is Genomics. You grapple with what it means computationally to have (nearly complete) sequenced genomes available on the web, including human's 3 billion base pairs. The DNA alphabet of base pairs (letters) is made up of only four choices: A, G, T, or C (the abbreviations for their chemical equivalents, Adenine, Guanine, Thymine, and Cytocine). In short, genomics is "computer science meets biology" or "biology meets Big Oh." For the purposes of computing, a DNA molecule is a word (a string of characters) where the characters are taken only from this four-letter alphabet (A,C,G,T), for example the string: or 5-mer "ACGTC". Strings, strings, strings -- DNA, DNA, DNA. (Actually, we could also use protein-coding strings of amino acids, but we will stick with DNA this semester). In particular, we will focus on (i) string matching algorithms (finding all occurrences of "ACGTC" in the DNA sequence of C.elegans or some other organism), (ii) the representation of strings in a suffix tree data structure, and (iii) state-of-the-art sequence databases and their search algorithms (e.g., BLAST) and their potential uses. We are especially interested in addressing some of the "open search questions" which will require computational methods, e.g., dynamic programming to find inverted repeats that do not base-pair-match exactly (irregular palindromes) in sequences of DNA. In particular, you will do four labs and an individual final project on problems in genomics and/or evolutionary trees. Final projects include the potential for you as an undergraduate to ask novel research questions as you search DNA sequences from the human or other genomes.

The sequencing of genomes (e.g., files filled with megabytes of A,C,G,T's) provides a new and rich source of data that is screaming for programmers to help make sense of it all. Said differently, the biologists know that they can not do the work alone: they need you! And software developers rely on the expert scientist(s). We will use the expertise of local biology faculty doing research in this area. They will serve as the "lead scientist" and help us define problems you might work on. The intent is for some of your homework to be research.

    Designing and implementing algorithms to search genomes is like standing at the bottom of a mine with a pick axe: "There's a lot of mining to be done; might as well start here." Some will find gold.

This course does not require previous experience in college-level biology, although it is clear that a number of students do come with some background. Our genome expertise and guidance will come from Dr. Betsey Dexter Dyer, Professor of Biology here at Wheaton College. Some labs and your final project will be true research. Your individual final project has the potential to ask an original research question.

Curriculum: This course is one of four core courses that are required for a computer science major or minor at Wheaton. In many ways, Algorithms is the first course that challenges students with the sometimes subtle rigor of the discipline. It seems that everyone loves computers today, but the requirements for reliable and efficient software go far beyond the glitz of the web. The sections in an Algorithms text, for example: Big Oh notation, growth rates, combinatorial search, heuristic methods and intractable problems are themes which emerge throughout most subdisciplines of computer science.

As always ...

In computer science, if you are almost correct you are a liability.
Fred Kollett (1941-1997), Professor of Math and CS, Wheaton College

Exact pages to read are assigned in lecture and lab.
Reading Topic
** ANSI C text A taste of C...
Porting your code to other platforms: CodeWarrior(MacOS), Visual C++(Win), gcc (RedHat Linux). Intro to input/output in C.
** you program a0: Quick C review
due Friday, Sept. 05
********** Ch. 3 Big Oh and napkin math ...
because f(n)=O(g(n)) matters

you program

Ch. 1

a1: FindAllPrimes()
due Mon., Sept. 15

On to C++ ...

* selected reading:
"View from the waterworks: the world of NCBI." Genome Technology, July 2002.
Mon., Sept 15
Prof. Betsey Dyer (Biology) guest lecture: Genomics 101 (with a live Julia-Child-like demo of separating DNA)
** Genomics Lab1
BLASTing the 16S RNA Genes
Wed, Sept. 17

notes on string searching

Ch. 11

Search -- the human genome is 3 billion letters ... no time to waste! Knuth-Morris-Pratt (KMP) algorithm.
**** you program a2: eLmer Jr.
Searching for strings of DNA

due Mon. Sept. 22
**** Genomics Lab2
Tweeking eLmer Jr.
Wed, Sept. 24
********** selected readings in text Recursion, recurrence relations, induction (oh yeah, prove it)
****** you program; recurrence relations and closed-form revisited a3: isPal() - Finding Inverted Repeats
due Wed., Oct 22
****** Krane and Raymer (2003), Ch. 2 Dynamic Programming; Needleman-Wunsch and Smith-Waterman algorithms
** Guest Lecture Friday, Oct. 17
Prof. Tommy Ratliff (Mathematics) on "Finding Roots of Equations"
******** Ch. 10 Sorting (your socks et al.)
**** you program a4: Experiments in Sorting
due Fri., Oct. 31
**** Genomics Lab3
Specifications for Research Project on the Genomics of Olfaction
Fri, Oct. 31
******* Ch. 12 Graph Data Structures
***** Ch. 12 Breadth-first, depth-first search
************ Notes Adversarial search in games -- check mate dude
**** Genomics Lab4
Briefing to coordinate your experiment, results, and presentation
Mon., Nov. 24
*********** Ch. 6 Minimum Spanning Tree
*********** Ch. 12 Shortest Path -- I got here first
*********** you program Final Project in Genomics
oral presentation of results:
Sunday, Dec. 07 in the evening
**** you program a5: King Saul Checkers
due Fri., Dec. 12
The exact pages for you to read in your text and/or on reserve will be assigned in lecture.

Your Grade:
Things to do Grading Percents Frequency
4 Genomics Labs 5% Sept. 17, Sept. 24, Oct. 31, Nov. 24
Homeworks 15% TBA
6 Programs 40% (see dates above)
2 Exams 20% Wed., Oct. 08 and Wed., Nov. 07
Final project 10% Oral Presentation of results: Sunday evening, Dec. 07
Final Exam 10% TBA

Late Submissions:
Due is due. Always turn in whatever you have on time. Something turned in on time is much better thanreceiving no credit because it is late. If the program is "due" by Tuesday, 5am, I reserve the right to turn OFF the dropbin, that is, you will not be allowed to submit your source code after that point. If your code does not compile/run/work right, submit what you have and tell me in your README file. So let us agree together: all assignments are due as stated. (Exceptions granted for serious reasons). If your professor is your boss (and he is of sorts) and your boss wants it when she said she wants it, then you submit what you have and explain what it does and does not do in your README file. Late is not an option. (Good, glad we can all agree with this).

Honor Code Revisited: It goes without saying that all submitted work will be the student's own, in keeping with the Wheaton Honor Code. Use discretion; don't ask your colleague for "the" answer or for a section of code. However, I do encourage you to discuss the problem in general, such as the type of data structure one might use or to find a syntax error. For homework, your answers and code must be your own from beginning to end. On the exams, you may not get help from anyone but the instructor.

(0)It is expected that you spend at least 3 hours on reading and practice problems for every 90 minutes of lecture. This computes to at least 6 hours of work in the text per week. This should be done throughout the semester and not just when studying for exams. The material is cumulative in a big way; for example, week 5 depends heavily on weeks 1 through 4.
(1) It is expected that you spend at least 6-10 hours per week on your current programming assignment. WARNING: Programmers typically underestimate the time it takes to complete a software project; 6-10 hours per week on your programming assignment may be one of those "underestimations."

There are two exams during the semester and a final exam during finals week. There will be no makeups, nor will the lowest exam grade be dropped. If you are an athlete and/or you have a conflict with an exam date, please see me within the first week of classes.

There will be four labs in genomics. In Lab1 you will use the most common and one of the highest powered web-based genomics tools: BLAST. BLAST helps you "blast" your way through multiple genomes, for example, to see just how alike the DNA sequences within genes from one genome (e.g., a bacteria) match the DNA sequences in genes from other organisms (e.g., a worm, a human, etc). After completing your third program (eLmer Jr. -- a program to find DNA words of length L), Lab2 will be a demo of your program to a biologist and the biologist will suggest some possible "tweaks" to your program, e.g., how to alter eLmer to look for other things. In Lab3, we will discuss the specification for your final projects. Lab4 will serve as a briefing where you can report on the status and your plans for your final project, experiment, and oral presentation.

You will work on one "large" problem in genomics. The biological significance of the problem will be explained by a biology expert; however, the software will be entirely up to you. More specifically, you will work on the "Genomics of Olfaction" where you will design and implement an experiment to find locate groups of DNA sequences within genes which have the highest summation of hydrophobic values. Many more details to follow. An oral presentation of your results (as if to a large audience at a conference) and documented software are included in the requirements.

I have listed my office hours on the syllabus. But you know I'm always near a keyboard! Don't let this material bury you. Study, study, study and talk about it with me and others as often as you can. Homeworks and programming assignments can be especially challenging. Visit me; ask questions!

Please don't wait too long before you see me;
a quick chat in my office can often clear things up.
I'm here a lot

    Maintained by: Mark LeBlanc
    Dept of Math & Computer Science
    Wheaton College, Norton, Massachusetts