A note about the video: I usually use the whiteboard when teaching and I write much neater than I do in the video. However, I had to use a tablet for video recording purposes, and it’s really hard to write on a touch screen!
]]>After looking through the archives of my computer, I calculated that I wrote 197,717 lines of code (141,008 insertions, 56,709 deletions) for my classes in the past four years! In fact, I believe this is actually an underestimate. In this post, I’ll talk about how I calculated this number, and assumptions I made when calculating this number.
Fortunately, except for freshman year, I used git for all my projects and assignments. As a result, I can compute the number of lines of code I wrote by examining by git log history. Using git made calculating my contributions on group projects especially easy as git keeps track of who authored each commit. Furthermore, by examining the insertions and deletions of each commit, as opposed to examining the line count at the HEAD of the branch, I can see a more accurate number of lines I wrote and deleted, as opposed to the final count at the end of the assignment/project.
For each git repository, I did the following:
Determine which commits were authored by me, using the following command:
git log author=Kenny oneline
.
For each of those commits, examine the insertion/deletion count for that
commit, broken down for each file: git show COMMIT oneline numstat
Aggregate those stats by file name across all the commits in a repository, then aggregate those stats by file type.
Do this across all the git repositories for all the courses I took.
If you want to examine your own stats across a repository, you can check out my python counting script. Here’s a snippet of the relevent code performing the steps above.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 

I followed these rules when calculating my total line count.
Ignore autogenerated, binary, or raw data files when calculating a line count.
When I performed the count on my various git repositories, I noticed that I
was not always very disciplined on what I was checking into the repository.
Often, courses would want us to submit files that were autogenerated, and so I
checked in autogenerated files, binaries, images, csvs, etc. into the repository.
Thus, I made an option in my scipt to ignore certain file extensions so
that I do not get an inflated line count. Furthermore, I took the notion of “code”
to mean anything that was not autogenerated, so I included file types like
.tex
, .txt
, .md
and README
as code (as long as they were not autogenerated).
Prefer underestimates over overestimates. I did not know how to use git freshman year, and as a result, I cannot get an accurate count for the number of lines of code for CS50, CS51, or CS179. As a result, I only included code that I entirely wrote myself, which excludes modifying code from distribution code from a problem set, and code I wrote as part of team projects. For CS51, I had the original tarball distribution for some of the assignments, so I was able to diff the distribution code with my final assignment submittion. As a result of this exlcusion, the line count for these three courses is much smaller than it actually is.
Only include code written for assignments. As a result, I did not count lines of code written for section or lines of code I wrote when TFing these past 3 years.
For reference, here’s the list of the computer science classes I’ve taken and when I took them (and possible links to related blog posts).
It’s not a surprise for me that I have written more C code than any other language in college,
but it surprised me that python was a close second. However, I realized that I
now use python as my goto language for prototyping and data analysis, and I’ve used
python in more classes than any other class (CS283, CS181, CS109, both CS91r’s, ES50, CS261)
compared to C (CS61, CS161, CS165). I was surprised the java count was so high,
as I have used java mainly in internships. I realized that this came from
CS124 (I used java for Mitzenmacher’s programming assignments) and for CS262
(you gotta use java if you’re in Jim Waldo’s class!). Furthermore, I was shocked
that the ocaml (.ml
) line count was so low, as I felt like I wrote much more code
when taking CS153 (Compilers). However, I haven’t written ocaml code for a class since
compilers, and so this count makes sense.
I was a bit surprised at first on why the theoretical classes (CS124, CS121) had
such a high line count, and then I realized it was mostly due to .tex
files.
CS161 is often considered the most difficult and timeconsuming class at Harvard, and so I thought that CS161 would probably have the highest line count. I was surprised that CS181 and CS165 beat the CS161 count. I believe that because there was no distribution code for CS165 (Data Systems), I had to write a lot more (but less interesting) code to make all the glue for my database. For CS181, the course and assignments were so disorganized the year I took it, and as a result, there were frequent large commits that were mostly overhaul and rewriting everything.
When examining the line count by semester, my spring semesters have a much higher line count than my fall semester, and my Spring 2013 semester has the highest count (not a surprise! I was taking CS161 and CS181 at that time).
In the end, I calculated that I made 197,717 changes (141,008 insertions, 56,709 deletions) over the past four years. This number is probably an underestimate, but I assume it’s around the ballpark of the true number of lines of code I’ve written in college. This makes me appreciate just how much one can learn and do in four years!
]]>This past semester, I took ES50, Harvard’s introductory course in electrical engineering. For our final project, my group decided to make a voiceactivated drink mixer! I was in charge of the coding component of the project.
The code is available on github. To do the speech recognition, I used the Chrome Speech API. Once I have the transcribed text, I send the text to a local server, which figures out the drink that was ordered, and then sends the appropriate times to open each of the bottles. The server sends these times to the attached Arduino, which then sends current to the appropriate solenoid valves for the designated times. When activated, the solenoid valves allow liquid to flow through.
]]>At Harvard, a “Teaching Fellow” is the equivalent of teaching assistants at most other universities. Technically, I’m a “Course Assistant” as the title “Teaching Fellow” is reserved for graduate students, but in many of the classes that I’ve taught, the undergraduates have the same (if not more) responsibilities than the graduate students. These typically include teaching section, holding office hours, and grading.
Over the past three years, I’ve attempted to have an impact in all of the classes I’ve taught, and hopefully that impact will last after I graduate. Also, I’ve learned several lessons about teaching computer science classes, and I have advice for current or future undergraduates considering teaching.
Below is a list of some of the ways I’ve contributed to the courses that I’ve taught:
CS50 Section Notes  Before CS50 standardized the section notes for each section, many of the teaching fellows often prepared their own material for section (based on a set of example material from previous years). I took this as an opportunity to create material that I wish there were more of when I took the class: more interactive coding labs, and fun technical interviewesque problems for students with the extra time and interest. The end result of this include coding labs teaching students file IO (reading/writing pokemon structs from/to disk), implementing essential data structures, building a pokedex (endtoend web application with a mysql backend), autocompletion (how to perform asynchronous http requests), and many brainteaser coding questions.
CS51 Moogle: 23 Trees  CS51 is one of the few courses at Harvard that uses OCaml as the core programming language for the course. The class is famous for it’s moogle problem set: at the end of the assignment, students will have a working web crawler that can index a graph of web pages and then rank them with different ranking algorithms (e.g., PageRank). The goal of this assignment is to teach students abstraction and modularization while implementing sets, maps, and rankers in different ways. In my first time teaching the class, I was tasked to write a new portion of the assignment: have students implement balanced trees with 23 trees. Implementing the 23 trees was definitely a nontrivial task, but what made it more difficult was structuring the code so that (1) it would be instructive for students who would need to understand and modify the code to implement the 23 trees, and (2) the code would allow for proper unit testing when the course staff later grades the student submissions. Two years later, the course is still using my code in that assignment!
CS51 Object Oriented Programming Notes  When I took CS51, the course was still taught in both OCaml and Java. In my first year teaching the class, the professor decided to axe the Java portion of the class because it was too difficult to introduce object oriented programming concepts while having the students pick up and entirely new programming language in only a few weeks. As a result, the course staff decided to teach OO using the OO side of ocaml. I was tasked with writing the section notes for this material. It was a learning experience to figure out how to introduce so many new terms (objects, classes, subclassing, inheritance, interfaces, methods, overriding, polymorphism, subtyping, …) in one section and not overwhelm the students. Two years later, the course is still using these section notes!
CS61  When I taught CS61 for the first time, it was also Eddie Kohler’s first time as the instructor for the course. He taught the course differently from previous years and emphasized different concepts and as a result, the course required a whole new set of section notes. Furthermore, the course had a much smaller staff than the previous classes I had taught. As a result, I ended contributing to and writing many of the section notes for that year. I took that as an opportunity to present the course material in a different light from the way material was presented in lecture, hopefully providing confused students a clearer picture of course concepts. In my second time teaching the class, I was one of the few returning course staff from the previous year, and I felt honored that Eddie valued my opinion on what I thought were the good and bad parts from the previous year.
CS161 Synchronization Problems  At the beginning of this semester, I was tasked with writing the sychronization problems for the synchronization problem set. I tried to phrase the problem in an amusing and instructive way, and hopefully these problems will be used again in future offerings of the class.
In summary, an undergraduate teaching fellow can have a huge and lasting impact on a course, including coming up with new assignments, writing new section notes, or directing the overall direction of a course.
After teaching in so many different classes and so many different students over the past three years, I’ve learned a few lessons:
Students can achieve more than they think they can. I remember before teaching my very first section for CS50, I was told that my section was a “More Comfortable” section. In CS50, students are placed into sections based on how they selfidentify themselves into the buckets “Less Comfortable”, “More Comfortable”, and “Somwhere In Between.” As a result, I spent that week preparing material I thought would be appropriate for more advanced students (material including code labs and fun brain teaser technical interview questions). When I stepped into the very first section and double checked with students that this was a more comfortable section, they all gave me grim stares of horror and told me that the section was actually “Somewhere in Between.” It turns out that the head TFs for the class had accidentally informed incorrectly about my section. However, I still taught the section as if it were more comfortable: I still prepared for the code labs and brain teaser coding questions, and I tried to be as clear and instructive as I could in my slides and explanations of course concepts. At the end of the semester, many of the students did very well in the class and thought section was taught at an appropriate, if not slow, pace. As a result, I learned that students can undervalue their abilities: the students regarded themselves as “somewhere in between” when they achieved just as well as the “more comfortable” students.
Make section relevent and useful for the students who attend. In all of the courses that I’ve taught, section was always optional but highly encouraged. I’ve been in many classes were I had required section that I thought was pointless, or I’ve attended optional sections only to find them unhelpful, discouraging me from attending future sections. As a result, I highly value the students’ opinion when I do have the privilege of them attending my section, I want to make all of my sections helpful and useful for the students that choose to attend. To do this, for CS50, CS51, and CS61, I would always email a short anonymous feedback survey to students in my section to see what they thought was good and bad, and what they wanted to cover the following week. I took this feedback to heart when planning material for sectiones, and as a result, my sections always had a consistently near 100% (and for CS61 my first time, > 100%) attendance, when many other sectiones taught by other course staff had lower a lower attendance rate.
Effective teaching requires planning, planning, planning. Before every section, I would always plan out the agenda for the section, making sure every concept transitioned smoothly to the next, and I had clear explanations and guiding questions to motivate the material. Often, planning took longer than the actual length of time of section. I learned very quickly that I was better and more comfortable teaching using the white board/chalk board than using a slideshow. Using the board allowed for more interactivity with students, and it made it easier to draw diagrams. Also, writing things down on the board give time for students to pause and think, whereas it is often difficult for students to read the text on slides while at the same time having to listen to what the instructor has to say. As a result, I filled notebooks with notes on how I would present the material in section, carefully planning out my boardwork, how to make the most effective use of the board, and planning diagrams I would use to explain the concepts. From this experience, I learned to appreciate and admire the planning teachers have to do in preparation for classes, and I also learned that I greatly enjoy the lesson planning part of teaching.
The most valuable thing you can do for students at office hours is to teach students how to discover the answers themselves. One of the things I’ve learned in my four years as an undergraduate is how to go about searching for an answer to a question–often involving googling, experimenting at the command line, and code reading. As I’ve moved on to higher level and more difficult courses, the thing that I notice more about the more advanced students is their ability to independently acknowledge what they don’t know, and then take the initiative to go about searching for the answer themselves. My experience with CS50 office hours typically involved conversations of the form: “Student: Things don’t work, can you fix it? Me: what have you tried? Student: not much.” and the student would then sit with me until the problem was resolved. At office hours for CS161, the conversations are typically of the form: “Student: Things don’t work, do you have any ideas why? Me: what have you tried? Student: gdb, grep, find, binary searching the problem…” As a result, I realize that what makes students more “advanced” is their ability to selfdiagnose their own problems and take the initiative to resolve them. Thus, my philosophy for office hours is to emphasize teaching students the tools to go about solving a problem instead of telling them the answer directly. One of my students remarked on this philosophy in a comment in the Q guide, stating “Kenny has tough love at office hours.”
Grading is very difficult. For me, grading is typically the most difficult and time consuming part of being a TF. Automated testing for correctness is not enough, as students typically (and rightfully) want indepth feedback on how they can improve. As a result, much of my experience grading as a TF has been learning how to give appropriate and useful feedback, and I still have much to learn in this area.
Professors are people too. When you’re taking a class and spending many allnighters on a problem set, it’s easy to assume the professors are monsters and forget that professors are people too with their own lives, families, and goals. After working with four different instructors on their course staff, I see the course from the course staff point of view and I begin to understand why professors structure the courses the way they do, and how much they do in fact care about their courses and students, despite them not seeming to do so when you’re up coding late into the early morning.
After all the lessons I’ve learned and work I’ve put into teaching, I highly encourage other undergraduates to consider teaching as well for the following reasons:
Teaching is a great opportunity to get to know professors. For large lecture classes (typically the intro courses), it can be very difficult to get an opportunity to talk oneonone with professors and have them know who you are. When you are on the course staff, the instructor personally relies on you and the other course staff to run the course. You get the rare opportunity to work with them and get to know them on a more personal level.
You don’t really understand the material until you have to teach it to someone else. Teaching is great opportunity to review and solidify your understanding of the course material, and in my experience when I teach, I always learn something that I didn’t know when I first learned the material.
You get to see how a course is run and control the direction of a course. When you’re on the course staff, you see and run everything: the infrastructure for distributing and receiving student submissions, the scripts and tools used for grading, the discussions for deciding what to cover in the next lecture, section or assignment, and more. As a result, course staff can typically have a large impact on a course, including coming up with new assignments, section material, or guiding the direction of the course material.
You become a mentor figure for underclassmen. I still remember the legendary TFs I’ve had and how I admire them want to emulate them. When you teach, you often become a mentor for students entering the concentration, and you can have a large influence on the courses they choose and how they progress through their time at Harvard within the concentration.
Of course, there are downsides as well for being a teaching fellow:
Office hours is 24/7, even for classes you’re not teaching. For large classes, you often have many friends in the class. As a result, friends will direct their questions to you inperson, through instant message, and through many other means even when you’re not having office hours. You’ll also get questions for classes that you’re not teaching. It can be difficult to draw the boundary between being a helpful friend and being a teaching fellow.
It is time consuming. This semester, I’ve probably spent more time working on CS161related work than any of my actual courses. Office hours, teaching, preparing for section, grading, attending lecture (for the hybrid classroom) can really add up. In my opinion, it’s like taking a fifth class.
After three years of teaching, 4 different classes, 3 Certificates of Excellence in Teaching, over 40 sections taught, and nearly 100 students I’ve had the privilege of teaching, I’ve seen some of my own students become teaching fellows for the same classes or other classes (I’m a grandTF, haha), and I like to think I influenced their decision in some way. I’ve also had many underclassmen in my section ask for computer science advice, and I’ve now seen them advanced through multiple classes within the concentration.
I want to personally thank David, Greg, Eddie, and Margo for giving me the opportunity to work with them and teach: teaching has definitely influenced my undergraduate experiece in a significant and positive way, and college would not have been the same without it.
Being a teaching fellow has been an important experience for me in terms of selfdiscovery: I learned that I really like to teach and plan lessons, and this makes me want to pursue some teachingrelated work in my future.
]]>I remember how fun these problems
were last year (forming little fellowships of the ring and piazza posts,
meant to mimic creating barriers and readerwriter locks), and
I wanted to make sure the problems were just as fun this year.
I was tasked specifically to write problems to mimic the
synchronization one would use to implement waitpid()
/exit()
(how would you do it?) and the synchronization needed
between address spaces and the coremap when implementing
a virtual memory system in the third assignment. Given
these specifications, I came up with
Singing Cows and Hunger Deletion Games synchprobs!
To keep up with the playful spirit of the problems, I
disguised the waitpid/exit problem as a Singing Cows Problem:
a daddy cow must wait until each baby cow finishes
singing “Call Me Maybe” before the daddy cow can congratulate
the baby! The final version of the problem eventually
mimicked wait()
instead, essentially making the daddy
cow wait until any baby cow finishes singing.
I had just watched Hunger Games: Catching Fire, and this
was my inspiration for the second problem: Hunger Deletion Games.
In this problem, Katniss and Peeta each have multiple threads
and are attempting to sever mappings between the districts
and the capitol (for the sake of the problem, assume there
are NSLOTS
districts). These mappings are represented
by a bijection between
capitol slots and district slots. The catch in this problem,
however, is that Katniss and Peeta are concurrently deleting
from opposite sides (Katniss from the capitol side and Peeta
from the district side), so students must avoid
both race conditions (concurrent deletions of the same slot)
and deadlock (concurrent deletions of the same mapping
from opposite sides). This situation mimics the coremapaddress
space situation in which threads handling a page fault
need to access a page table entry and then
a coremap entry, while a cleaner thread simultaenously needs
to access a coremap entry and then the corresponding page
table entry. I remember it took me several weeks last year to fully
understand the synchronization needed for this coremapaddress space
situation, and I was curious to see what kinds of solutions
students came up with. How would you solve this problem?
To see the source code for the problems and scripts to check the solutions, see the github repo.
The problem statements are shown below. Correct implementations should avoid big lock solutions, and should not allow race conditions, deadlocks, and starvation.
A cow has many children. Each baby cow puts on a performance by singing lyrics to “Call Me Maybe.” Like a good parent, the daddy cow must sit through each one of its baby cow’s performances until the end, in order to say “Congratulations Baby N!” where N corresponds to the Nth baby cow.
At any given moment, there is a single parent cow and possibly multiple baby cows singing. The parent cow is not allowed to congratulate a baby cow until that baby cow has finished singing. Your solution CANNOT wait for ALL the cows to finish before starting to congratulate the babies.
Here is an example of correct looking output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 

Katniss and Peeta are tired of Hunger Games and want to play a new kind
of game instead, the Deletion Games! They want to sever all ties between
the Capitol and all of its districts (for the sake of this problem, assume
that there are actually NSLOTS
districts). Katniss is severing ties
from the Capitol side, and Peeta is severing ties from the Districts’ side.
There is a 1:1 correspondence between capitol_slots
and district_slots
. This
means that each slot in capitol_slots
has exactly one corresponding entry in
district_slots
, and each slot in district_slots
has exactly one corresponding
entry in capitol_slots
. More formally:
1 2 3 4 5 6 

Katniss and Peeta each will use NTHREADS
to delete these mappings. Katniss
will delete mappings based on randomly generated capitol indices, and Peeta
will delete mappings based on randomly generated district indices.
For example, suppose Katniss randomly chooses capitol index 4 to delete. She looks at capital slot 4, sees that the slot is still mapped, and finds the corresponding district index is 12. Then Katniss will free the mappings in capitol slot 4 and district slot 12.
Suppose Peeta, on the other hand, randomly chooses district index 12 to delete. He looks at district slot 12, sees that the slot is still mapped, and finds the corresponding capitol index is 4. Then Peeta will free the mappings in district slot 12 and capitol slot 4.
However, without proper synchronization, we may get:
Your solution must satisfy these conditions:
thread_yield()
calls in your code to
convince yourself of no deadlock.Here is an example of correct looking output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Consider the following example from my data systems class, where I initialize a directory to act as persistent storage for my database.
1 2 3 4 5 6 

Here’s a first pass at initializing a storage struct. In this example, I ignore all errors and will throw an assertion error if an error occurs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Obviously, this is not robust. If any of the operations return an error, we would get an assertion failure and our server process would exit unexpectedly. Thus, we need to check for errors and cleanup all the calls that occurred before the error.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

As you can see, performing error handling naively like this
results in quadratic growth in cleanup operations: each
error checking needs to cleanup every operation before it,
and as a result, the free(storage)
line gets repeated
multiple times. Can we do better?
Yes! The key to this is use goto
statements. Many
introductory computer science courses discourage use
of goto statements, and rightfully so: goto statements,
if used inappropriately, can lead to spaghetti code
and can make code very difficult to reason about. However,
error handling is a perfect use for goto statements
to avoid quadratic code growth.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

By laying out the error handling code labels in reverse order in which the operations were invoked, we can quickly jump to the appropriate position to start cleaning up all the operations that occurred before it. This eliminates the quadratic code growth in error handling! Furthermore, there is only one exit point of this function (at the very bottom), and reasoning about exit points for this version is much easier than the previous version, especially when we throw in concurrencry primitives and needing to remember to release locks.
To eliminate the boiler plate of checking the return value and then jumping to the appropriate label on error, I wrote a couple of useful macros. It relies on design decision to make all functions that may have an error:
NULL
(e.g., if the function allocates a data structure)int
, where the int is an error code specific to your application, and 0 is success.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

In this code, I define an enumeration enum dberror
to represent
the different kinds of error codes for my database application.
I also provide a DBLOG(result)
macro which, when given an error
code, prints out the human readable string for that code, as well
as the file, line number, and function where DBLOG
was invoked.
By designing your internal API using the two points above and invoking
DBLOG
every time an error occurs, we effectively get a stack
trace for every error!
Now let’s combine this error logging facility to reduce the boiler plate for the error handling code above.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

The TRY
macro allows us to execute expr
, and if
that returns a nonzero error code, we jump to the provided
cleanup label.
The TRYNULL
macro is similar–it assigns var
to be the
result of expr
, checks if var
is NULL
, and if it is,
assigns the appropriate error code to result and jumps to
the cleanup label.
Using this, let’s write our final version of storage:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 

Nice and simple! Here’s all the things that this pattern addressed:
TRY
and TRYNULL
macros eliminate the boiler plate, and automatically performs logging to give us a stack trace of errors.My goal this smester was to restart the bootcamps and to revamp the curriculum. In this post, I’ll talk about:
Bootcamp Setup and slides. From my experience leading bootcamps last semester, I realized it is really hard to cater to so many different programming backgrounds and machine setups. As a result, I chose to standardize and require students to install a UNIX system with a package manager:
Intro to UNIX Part 1: Command Line and slides.
This bootcamp presents the UNIX command line and the stdin, stdout, stderr, and pipe abstractions.
Exercises
include scavenger hunts through a code base using find
, grep
, and piping
a sequence of commands together transform and analyze files.
Intro to UNIX Part 2: Shell Scripting and slides. This continues the introduction to UNIX with shell scripts. Exercises include:
Git and Github and slides. This bootcamp introduces students to version control and using git with Github. Exercises include
Intro to Python and slides. This bootcamp introduces students to basic feature of Python, including control flow and data structures (lists, sets, dictionaries, tuples, strings). Exercises include:
I designed the workflow centered around github wiki pages and pull requests. Here’s the workflow for a typical bootcamp:
I designed the curriculum and workflow with multiple goals in mind:
Flipped Classroom. From my experience in classes with flipped classrooms, students learn (at least with programing exercises) much better when they have handson exercises with guidance from the instructor. With this in mind, I wrote all the bootcamps to minimize the amount of lecturing I give and to maximize the amount of time students would be programming and asking questions.
Useful Software Engineering Skills. With these bootcamps, I wanted to provide others with the exercises and support that I wish I had had as a freshmen–exercises to teach some basic skills that would be useful not only in industry, but also useful in an academic setting. From my experiences in internships and classes, some of the most useful skills that have learned are
Start from Zero. Because HCS’s target audience is students with very little programming experience, I wrote the bootcamps so that anyone starting from zero programming experience could quickly get their environment setup and start using the command line. Naturally, there will be students that already have programming experience, and I supplemented the basic exercises in bootcamps with more involved exercises that students can do on their own pace.
Feedback System. I believe it’s very important to get feedback on your work, especially when programming for the first time. Therefore, I designed the bootcamps to use github’s pull request feature for comments and feedback. Pull requests allow us to leave inline comments on code, and to provide a commentdiscussion feature for general feedback. I also encouraged students to install the GH Diff Highlight Chrome Extension to colorize pull request diffs.
Here are examples of the feedback we provided on the Intro to Python bootcamp.
Reusable. I am a senior and will be graduating soon; as a result, I want the bootcamps to be reusable after I leave. Because of this, I made all the bootcamps open source github repositories so that they may be reused, updated, and forked as necessary in the future.
Overall, attendance was generally higher than I expected. The attendance for the setup and first two UNIX bootcamps were roughly 3040 people, which is one of the highest attendance rates HCS has had in a while for a bootcamp. Naturally, as the semester progresses, students get more busy with midterms and assignments, and so attendance dropped to about 20 for git, and a dozen or so for the Python bootcamp. After each bootcamp, I posted a survey asking for feedback. I asked the question:
What did you think of the bootcamp? (What you liked, didn’t like, what was useful, wasn’t useful. What would you have done differently? etc.)
The feedback was generally positive. The types of feedback were generally along one or more of these categories (see the Testimonial section at the end for real responses):
In terms of numbers, the Git and Github Bootcamp had 28 forks, with 16 students successfully submitting a pull request. The Python Bootcamp had 16 forks, with 10 students sucessfully submitting a pull request, and 4 students finishing all the python exercises.
From these numbers and testimonials, it seems that the flipped classroom model worked very well, and the various levels of exercises and walkthroughs catered to both advanced and beginner students.
Below are the unedited and anonymized testimonials from students:
This seems well done, albeit rather basic. Maybe mentioning “do one thing well” to explain why UNIX works the way it does.
I thought it was good since it taught me all the basics in one sitting
Like it.
I think this was great although taught a little quickly
Useful, but went a little slowly; Overall was run very well, learned some useful UNIX commands.
Very useful! I learned a lot about some of the basic commands available to us. I wish we’d dived into the scavenger hunt sooner so we could’ve had more time for it and the shell script exercise.
I liked how many functions could be linked to each other. That was really cool. Also how there are functions that allow tons of flexibility with the char ability.
Very helpful especially oneonone. Went through the slides a bit too fast
Awesome! While I’ve done a good amount of programming before, I haven’t had a chance to learn many of the covered UNIX commands… until now.
Thanks for an awesome class, Kenny!”
smaller room preferably
Scavenger hunt was fun. I’m glad I sat next to people who were more familiar with this material.
Ir was really useful. I got a better sense of how to use the command line than ever before! Thanks.
It got a little hectic at the end…
I thought it went a little too fast for me. But I enjoy the premises of the program. I just wish it was not so much like a class, but a collaboration so that everyone feels involved.
I thought it was a really good topic, but it moved to quickly for me. It also would have been nice if there were more people to answer questions, because that was really helpful. Along with piazza, it would be nice to have a small recap session to go over the topics and solidify them.
This was a good set of exercises. I got distracted with unrelated things (c/p, sending the answers directly to answers.txt), so I lagged behind.
I liked it! I thought I knew shell scripting before, but now I feel more comfortable with channeling and piping and whatnot. :)
It was very informative!
I’ve learned a lot. It was very informative.
Great tutorial. Learned a lot about UNIX, especially piping.
could have been a little more organized. But people were very helpful.
Useful bootcamp, I didn’t know how to write shell scripts…
I liked having the recap of last week and generally enjoyed this exercise. I like the presence of a second exercise for those who finish early.
Could’ve used more instruction on the scraper parts  using man over and over again got annoying
Awesome overview of git. I knew a bit coming in, but setting up the alias for my “lg” command was great, so that I can see my requests in a more aesthetically pleasing fashion. Maybe a bit more on creating branches to make temporary changes and then merging your own branch back.
github is harder than i expectedâ€“it’s like a whole new world of stuff! so it was largely a struggle but i think i learned a lot
Liked how straightforward it was. I would maybe have thrown in a challenge exercise
It was very fun and easy to understand. Plus, I learned a whole bunch of stuff I never knew before (and a lot of things unrelated to git, but useful as a whole, like how to use vim). :)
It was fantastic! I don’t think I would have done anything differently.
Useful, already knew a decent amount about git. The pace was good, wouldn’t have changed anything.
It is difficult but manageable
Exercises are nice.
It was great! And useful!
I thought the bootcamp was useful. Github is a lot more manageable and less confusing after today’s bootcamp. I’m glad I came.
This one was really good and understandable!
I wasn’t able to come to the bootcamp on Wednesday, but I just completed it on my own  the slides and online directions were really helpful, and I could pretty much figure most things out on my own.
On a side note, I also completed the mail merger exercise from last week’s bootcamp, and the solution didn’t follow the instructions that we were given, so that confused me a little.
]]>Pretty cool. Exercises were nice, good practice. I was already pretty familiar with Python, though, so not representative.
It was useful, especially the talk about CS classes and coursework!
This was a great lesson. The only problems I had were setting up git stuff, but that is mostly due to the fact that this was the first time I came up. There is not much to change other than perhaps moving a bit slower.
It was great! Learnt a lot about python. Especially liked the exercises that helped me get a little more used to python syntax.
I love to code and I love the exercises, as well as the debugger tests that let me see what I was doing wrong and which lines. Python is so applicable. To improve, I think we could get thai(Spice?) for the food next time to add variety.
I liked that it had fewer directed parts and relied more on us learning some of python’s capabilities on our own
I published my first Chrome Extension: Omnibox GDrive Search! The extension allows you to search your google drive for documents and then jump directly to them from the omnibox! See the code on github.
In the omnibox, type gd
and press TAB
. Now you can enter your queries and jump directly to the file from the omnibox suggestions!
To use it, you must first authorize Google Drive metadata readonly access to the extension by following these instructions (see screenshots in extension link):
chrome://extensions
By far, the most difficult part of the extension was understanding authorization with Google Drive. The extension is a bit buggy and lags a little because the extension must renew the authorization with Google Drive every 15 minutes or so. To get around this, the extension only requests authorization on the first use after the expiration and will redirect the user to the options page. Thus, if you’re using the extension for the first time in a while, it may take ~3 seconds before search results come back.
]]>For my CS283: Computer Vision Final Project, I created an application to control Google Maps using your hand and a webcam.
It uses the Chrome API to access the webcam, and frames are sent to a Tornado server that runs the hand tracking pipeline, annotates the frame, and sends back the displacement vector to update the view of the map. I am using OpenCV for it’s Haar Cascade libraries.
You can view the code on github and setup an instance of the server locally!
You can also view the paper I wrote to describe the process I used in the pipeline. Below is a summary of the problem statement and the stages in the pipeline.
Given a poorly trained Haar Cascade Classifier (250 positive samples and 100 negative samples) to recognize hands, this project assembles a pipeline to improve the quality of the tracking. These steps include:
Flags.parse
in the main
method of the
application. The library offers support for various wrapper types as well
as collection types.
As an example of using the library, you declare a flag using the
annotation @FlagInfo
with the desired flag names and values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Then you provide flag values at the command line like so:
1


All classes referenced from the main class with flags will be available as options.
In addition to learning how to use Java’s Reflection capabilities, this was also an exercise in learning how to use Maven to build and deploy my project. I am using a github project as a maven server.
See the README.md
in the github directory for more information on how to
use it and install the library.
One of the most interesting problems I saw in the course involves Markov chains and a simple and elegant solution using another interesting problem we saw earlier in the course–the coupon collector’s problem.
Suppose $n$ cards are placed in order on a table. Consider the following shuffling procedure: Pick a card at random from the deck and place it on top of the deck. What is the expected number of times we need to repeat the process to arrive at a “random” deck, for some suitable definition of “random”?
To solve this question, we’ll need to answer a seemingly unrelated question first.
A certain brand of cereal always distributes a toy in every cereal box. The toy chosen for each box is chosen randomly from a set of $n$ distinct toys. A toy collector wishes to collect all $n$ distinct toys. What is the expected number of cereal boxes must the toy collector buy so that the toy collector collects all $n$ distinct toys?
The key to understanding this problem is to break the task of collecting all $n$ distinct toys into different stages: what is the expected number of cereal boxes that the toy collector has to buy to get the $i$th new toy?
Let random variable $X_i$ be the number of boxes it takes for the toy collector to collect the $i$th new toy after the $i1$th toy has already been collected. (Note: this does NOT mean assign numbers to toys and then collect the $i$th toy. Instead, this means that after $X_i$ boxes, the toy collector would have collected $i$ distinct toys, but with only $X_i  1$ boxes, the toy collector would have only collected $i1$ distinct toys.)
Clearly $E(X_1) = 1$, because the toy collector starts off with no toys. Now consider the $i$th toy. After the $i1$th toy has been collected, then there are $n  (i1)$ possible toys that could be the new $i$th toy. We can interpret the process of waiting for the $i$th new toy as a geometric distribution, where each trial is buying another cereal box, “success” is getting any of the $n  (i1)$ uncollected toys, and “failure” is getting any of the already collected $i  1$ toys. From this point of view, we see that $$X_i  1 \sim \textrm{Geom}(p)$$ where the probability of success $p$ is $$p = \frac{n  (i1)}{n}.$$
Here, our definition of the geometric distribution does NOT include the success. Using the expectation of a geometric distribution, we have that the expected number of cereal boxes the toy collector must collect to get the $i$th new toy after collecting $i1$ toys is $$E(X_i  1) = \frac{1  p}{p}$$ $$E(X_i) = \frac{1}{p} = \frac{n}{n  (i  1)}.$$
Now let random variable $X$ to be the number of cereal boxes the toy collector needs to buy to collect all $n$ distinct toys. Since we have separated the process into collecting the $i$th new toy, then $$X = X_1 + X_2 + \cdots + X_n.$$
Using linearity of expectation, we can compute the expected value of $X$ by summing the individual expectations of $X_i$. Thus, we obtain the following result: $$E(X) = E(X_1 + X_2 + X_3 + \cdots + X_n)$$ $$E(X) = E(X_1) + E(X_2) + E(X_3) + \cdots + E(X_n)$$ $$E(X) = n\left( 1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n} \right).$$
This is the harmonic series! The harmonic series diverges to infinity and grows approximately as $\gamma + \log n$ where $\gamma \approx 0.57722$ is Euler’s constant. Thus, we can approximate the expected number of cereal boxes with: $$E(X) \approx n (\gamma + \log n).$$
Coming back to the randomtotop shuffling problem, we first need to define our notion of “random” for our deck. In order to do this, we use Markov chains.
For our Markov chain, let our states be all $n!$ permutations of $n$card deck, and two states are adjacent if and only if it is possible to reach one of the states from the other through one step of this shuffle. For any state, we move to one of its $n1$ neighbors with probability $\frac{1}{n}$, or stay at the same state with probability $\frac{1}{n}$. Since all of our $n!$ states has degree $n$ (including loops), then by symmetry, the probability of having any permutation is equally likely. Thus, the stationary distribution for our randomtotop shuffling Markov process is the uniform vector $$\vec{s} = \left(\frac{1}{n!}, …, \frac{1}{n!}\right).$$
Thus, to define our notion of a “random” deck, we would like that after implementing our shuffling algorithm, the resulting deck is sampled from our stationary distribution: that is, our resulting deck is equally likely to be any of the $n!$ permutations.
Now that we have established that our shuffling process can be modeled with a Markov chain that has a stationary distribution, we use the idea of “coupling” to arrive at our solution.
Let deck $A$ be our original deck, and let deck $B$ be uniformly randomly sampled from all $n!$ permutations. Since the stationary distribution for our shuffling process is the uniform distribution, then deck $B$ is sampled from the stationary distribution.
We use the fact that if we start our Markov process from a state sampled from the stationary distribution, then the resulting state will also be sampled from the stationary distribution. More formally:
Lemma. Let $\vec{s}$ be the stationary distribution of our Markov chain. Let $X_0$ be our starting state, and let it be sampled from the stationary distribution (i.e. $P(X_0 = i) = s_i$). Then the resulting state $X_1$ after running the Markov chain for one step will also be sampled from $\vec{s}$.
Now consider our “coupling” strategy: every time we move a card $C$ to the top of deck $A$, we locate card $C$ in deck $B$ and place it on top of the deck. Note that the physical process of how we chose card $C$ in the two decks is different: we choose a random position in deck $A$, whereas we located card $C$ in deck $B$. Although the process of how we chose card $C$ is different, from deck $B$’s perspective, $C$ is simply a card selected at random. Using our lemma, we have that deck $B$ still remains sampled from the stationary distribution after moving card $C$ to the top of deck $B$.
We note that after $t$ steps, all the cards that have been touched up to time $t$ will be in the same order on top of both decks. When all the cards of deck A and deck B are in the same order after some time $T$ steps, we will have that deck A and deck B are both sampled from the stationary distribution (because B always stays stationary through our coupling strategy). Thus, after $T$ steps, deck A will satisfy our notion of a “random” deck. We wish to compute $E(T)$.
How do we compute $E(T)$? We note that both decks will be the same once we have touched all the cards. Therefore, we wish to compute the expected number of randomtotop shuffles needed to touch all the cards. This is an instance of the coupon collector’s problem! Instead of touching all $n$ cards, we wish the collect all $n$ coupons. Thus, after approximately $n (\gamma + \log n)$ randomtotop shuffles, our original deck $A$ will be a “random” deck. For $n = 52$, we require $E(T) \approx 236$ shuffles to randomize our deck.
]]>Early in the semester, we discussed several implementations of binary search, starting from a simplistic version and incrementally building up to a productionready version. I thought the binary search discussion was an extremely eyeopening exercise; it was my first time seeing invariants being used in proofs to prove properties about code.
Below is how I’ve written binary search since high school:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Here, I am using Doxygen style comments
for my specifications. In this version of binary search, I return the index of
any occurence of item x
in array a
, or return 1
if there is no such
occurrence. While this implementation is acceptable for an array of ints, it
is not particularly useful for other data types.
Using C++ templates, we can generalize this implementation to make it
polymorphic for any type T
, provided we provide a suitable comparison function
compare
where compare(p,q)
returns true if and only if p
is less than q
for some ordering of values of type T
. Thus, here is our attempt #2 at binary
search:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

Now, in order to call binary search, we must provide a function object compare
that defines how we compare two elements of type T
. Below is an example of how
we would invoke this version of binary search:
1 2 3 4 5 6 7 8 9 10 11 

We overload operator()
to allow IntComp
objects to be invoked like functions,
and we pass an instance of IntComp
to binary_search2
whenever we perform
a binary search on an array of ints.
Note one other difference between the two versions of binary search: in attempt #1, we had the line:
1


whereas in attempt #2, we replaced this line with:
1


For all these years, I’ve been writing binary search incorrectly! In the first
version, we may run into integer overflow if lo + hi
happen to be greater
than the maximum integer value for int
! In the second version, we fix this
subtle bug by first subtracting r
and l
, then halving the difference and
add the result to l
to calculate the new middle index m
. By subtracting
first, we are guaranteed that r  l
will not overflow (by the implicit
precondition that r
and l
are valid indices into the array and r > l
),
and thus m
will also be a valid index into the array.
We have generalized our binary search to work on an array containing any type. But, we have actually done more than this. In C++, iterators overload pointer syntax to represent collections of items. Using iterators, we can represent an entire range of items in a collection with only two iterators–one pointing to the beginning of the collection, and one pointing to the “position” after the last element in the collection. See the CS 207 blog entries here for more information on C++ iterators. In our example, however, we represent the array collection with a pointer to the first position and the number of items in the list. Because binary search requires random access into our collection, any collection represented by a random access iterator will be able to use the second version of our binary search!
Can we still do better? In our specification for binary search, note that we
allowed the index of any occurrence of our search item x
to be returned.
This ambiguity makes it difficult to make any real use of the return value of
binary search (except simply to check whether the item is in the collection).
Instead of returning any index, what if we returned a lower bound position
of the element x
in our collection? By lower bound, we mean the first index
into the array at which we should insert x
and still keep the elements
in sorted order.
For example, with the array {0, 1, 2, 5, 5, 5, 7, 9}
, the
lower bound of 0
would be 0
, because we can insert 0
into index 0
and
still keep our array sorted. The lower bound of 1
is also 0
by a similar
reasoning. The lower bound of 5
is 3
because 3
is the smallest index that
we can insert 5
and keep the array sorted. Similarly, the lower bound of 6
is 6
. Note that the lower bound of 10
is 8
, which is not a valid index
into the array. This is okay because the return value only indicates the index
that one could insert an item and maintain the sorted property of the array.
To implement this, we can think of the array as a collection of boolean values
where the entries are {false, false, ..., false, true, true, ... true}
(all
the falses occur together at the beginning of the array). The boolean values
correspond to whether our target element x
is less than or equal to the
value in that array position. Our goal, then, is to find the first true
in
the array, or return the last position (indicating that placing x
at the
end of the array would maintain the sorted property of our array). Building
on the polymorphism we introduced in attempt #2, here is attempt #3 using
the lower bound idea:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Nice, clean, and simple!
Note that this version uses only one comparison instead of two (as we did
in attempts #1 and #2)! This lower bound idea not only tells us whether our
element x
is the array, but where we should place it to keep the list sorted!
This code looks simple enough to verify the correctness by eyeballing it; but can we make this rigorous? Can we prove the correctness of this code? Yes! Here is the same piece of code but commented heavily with the proof of its own correctness.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 

To prove the correctness, we make heavy use of the post condition:
1 2 3 4 5 6 

Thus, all elements at indices less than the return value R
are less
than x
, and all other elements are greater than or equal to x
. We use this
both of these if and only if conditions in the two branches of the if
conditional to guide us on how we should update l
or r
.
In both of the branches of the conditional, we have that the new values of
l
and r
are maintained so that l <= R <= r
and still satisfy the
post condition of the function. Thus, the statement l <= R <= r
is a
loop invariant of the while
loop: it is always true on entering and
leaving the loop. To ensure that the loop terminates, we require a
decrementing function, a function that decreases on each iteration of
the loop and is equal to zero when the loop terminates. In this case, the
obvious choice for the decrementing function would be d = r  l
. We show
in both branches that the new values of l
and r
are such that
r_new  l_new < r  l
, and so d
decreases on each iteration. When
d = 0
, we have that l = r
, which is indeed when the loop terminates. Thus,
our final line return l;
is proven correct by the combination of our
post conditions, pre conditions (array is sorted), loop invariant, and
decrementing function. By analyzing the invariants in the code, the code almost
writes itself! Cool!
To view the code in its entirety (along with a couple of simple test harnesses for each version of binary search), check out the source here.
]]>I also installed $\LaTeX$ integration with Octopress using the handy hints from here. Now I can write pretty inline equations like $e^{i\pi} + 1 = 0$ or centered equations like $$\int_{\Omega} \, d\omega = \int_{\partial \Omega} \, \omega.$$ Nice! Hopefully this will motivate me to write more mathrelated entries!
To keep track of tags, I installed a plugin to generate tag clouds (see the right sidebar) using this plugin here. I also finally discovered how to make background images that are just noises using this background generator here. I like the simplicity of these backgrounds!
]]>