I participated in the second CodeSprint that was held in January 7 – 8 2012. I knew this contest from a link posted in a TopCoder forum thread. As far as I know, many companies will use the result of this contest to interview new employees/interns. Because I was in holiday after the final exams, I took part in this contest, maybe somehow by luck I could get interview from some companies ðŸ™‚

The contest had three types of problems: algorithmic, company, and real-world problems. The algorithmic ones are much like usual competitive programming problems. Company problems are set by some of the companies. Real-world problems are “real”, for example, one of them is creating a web service for resizing company logos.

During the 48-hour contest (yes, only 2 days), I managed to solve 6 algorithmic problems and 1 company problem. I ranked #38 overall and #2 in Indonesia (the first rank went to **dolphinigle**).

Here is an analysis on some of the problems.

### Algorithmic Problems

**Coin Tosses** (Accepted, 20/20 points)

I first tried solving this problem by transforming it into system of linear equations.

Assume that N = 3. Let f(x) be the expected number of tosses needed to make 3 consecutive heads, given that we have already x consecutive heads. Of course, the range of x is 0 <= x <= 3.

Now, let’s evaluate f(x). It is clear that f(3) = 0 (no additional tosses needed); this is the base case.

For 0 <= x < 3, there are two possibilities:

- The next toss results in head. We have x+1 consecutive heads now, so the expected number of tosses is 1 + f(x+1).
- The next toss results in tail. We don’t have any consecutive heads now, so the expected number of tosses is 1 + f(0).

Both cases happen with probability 1/2. Therefore, the formula for f(x), 0 <= x < 3, is:

or, more simply,

The complete formulas for N = 3 would be:

One may be tempted to solve the function using dynamic programming, but that does not work as the formulas are cyclic. For example, f(0) depends on f(1) but f(1) also depends on f(0).

Now, take a look at the equations once more. Actually it has been implicitly solved. We can work backward (back-substitution) like this:

So,

Having computed the values of f(3) and f(0), we can use them compute the value of f(2), and then f(1). We have solved the equations! This solution clearly runs in O(N) time complexity. We can precompute the results for all N and stores them in a 1000 x 1000 table. The total complexity is O(N^2 + T).

But there is a problem with this solution. Simply running this solution with large N and small M will generate a very large output that overflows double data type. Replacing double with, say, BigInteger, would probably result in TLE. So, I tried printing the results for small N and M, and then tried looking for pattern. The pattern was amazingly obvious. It turned out that the answer for this problem is simply .

This new solution can be coded in Java/Python with their built-in big integer data type, and would run fast enough for 100 test cases.

**Permutation** (Accepted, 14.67/25 points)

This problem is essentially an approximation to the Traveling Salesperson Problem. Just assume that the numbers are cities, and V[i][j] is the cost of going from city i to city j. A permutation of the numbers means a path that visits all cities exactly once.

The method I used was splitting the cities into blocks of 15 cities each (or may be less than 15 for the last block). Then, for each block, I did a DP with bitmask to get a Hamiltonian path considering only cities in that block, with the maximum cost.

The DP I used is two-dimensional: dp[u][mask] is the maximum cost of visiting all cities not contained in mask, given that the last visited city is u. The set mask is represented with a bitmask of n bits, where n is the number of cities in the block (15 or less).

So, the base case is

, for all u (we have visited all cities).

The recurrence is determined by choosing the next city v to visit, after visiting city u:

, for all v not in mask.

The maximum cost of a Hamiltonian path in this block is simply by choosing a starting city that leads to the maximum cost:

, for all u

The Hamiltonian paths for the other blocks are computed in the similar way, except that the starting city of a block must be the last city of the previous block.

**Count Strings** (Accepted, 50/50 points)

This was the hardest problem I solved in the contest. I had to learn regex parsing and converting it into finite automata, during the 48-hour contest.

First, I converted the input string into an NFA (Non-deterministic Finite Automata), and then transformed it into DFA (Deterministic Finite Automata). There is a nice tutorial and explanation on how to parse a regex into an NFA and then into DFA here.

To make it simple, a DFA is a directed graph, whose vertices are “states”, and edges are “transitions” between the states. The edges are labeled by ‘a’ or ‘b’. One vertex is marked as “starting” state and several vertices are marked as “finishing” state. A string of ‘a’ and ‘b’ is like a path in the DFA from the starting vertex. If the path ends in one of the finishing vertices, then the DFA is said to “match” or to “accept” the string.

For example, the input **((a|b)(a*)) 5** can be parsed into this DFA:

--a-->[1]--a / | ----+ | v / | [O]--+ [3]*--a--> | ^ \ | --b-->[2]--a

The starting vertex is vertex 0, while the finishing vertices are vertices 1, 2, and 3. Any string that leads to a path from vertex 0 to any of the vertices 1, 2, and 3 is accepted by the regex (try it with several such strings). Now, the problem is to count the number of strings which lead to paths of length 5 starting from vertex 0 that end in vertex 1, 2, or 3.

Let’s build the adjacency matrix of the above graph:

It is well-known that is actually the number of paths from i to j of length 1. More generally, is the number of paths from i to j of length n. Therefore, here are the steps to solve this problem:

- Construct a DFA from the input regex.
- Build M, the adjacency matrix of the resulting DFA.
- Raise M to the L-th power, using exponentiation by squaring.
- Sum all , where x is the starting vertex and y is a finishing vertex, for all y.

The hardest part is actually constructing the DFA. It took me several hours straight to learn the construct it from the above link.

**Picking Cards** (Accepted, 20/20 points)

First, sort the input in ascending order. For example, let the sorted input be 0, 0, 1, 1, 4. Consider the cards one-by-one. For each card, we will count how many ways to pick up to that card.

- The first card has c = 0. Because we haven’t picked any cards yet, it can be picked. There are only 1 way: {1}.
- The second card has c = 0. The smallest valid position for this card is the first position. So, there are 2 ways: {2, x}, {1, x}.
- The third card has c = 1. The smallest valid position for this card is the second position. So, there are 2 ways: {x, 3, x}, {x, x, 3}.
- The fourth card has c = 1. The smallest valid position for this card is the second position. So, there are 3 ways: {x, 4, x, x}, {x, x, 4, x}, {x, x, x, 4}.
- The last card has c = 4. The smallest valid position for this card is the fifth position. So, there are 1 way: {x, x, x, x, 5}.

In each step, the ‘x’s can be replaced with any valid sequence from the previous step. Therefore, the total number of ways is 1 x 2 x 2 x 3 x 1 = 12.

The smallest valid position for the -th card (1-indexed) is . The total number of valid positions for that card is then (or 0, if ). Therefore, the total number of ways is:

**Subsequence Weighting **(Accepted, 30/30 points)

Let’s try solving it with a simple dynamic programming solution. Let be the maximum weight of a subsequence that ends in . The recurrence would be

, where and .

Note that if we fill the dp table in ascending order of , and in decreasing order of for the same values of , and set the unfilled entries of the dp table to 0, we can reduce the constraints into the above dp into:

, where .

The value of “, where ” can be computed efficiently in O(log N) time complexity using a binary indexed tree. The solution to the original problem is then for all i. So, the total time complexity of this solution is O(N log N).

**Direct Connections **(Accepted, 35/35 points)

The problem essentially asks for the sum of cable length for each pair of cities, where the cable length for a pair of city i and city j is .

Consider the cities in ascending order of . Now, the equation turns into the sum of , for all i and for all previous cities j of i. Moreover, the absolute sign can be removed by transforming the equation into the sum of , for all i and for all previous cities to the left of city i, plus , for all i and for all previous cities to the right of city i.

Let’s solve for the sum of (the right one can be solved in similar way). If there are previous cities to the left of city , the equation becomes .

Hence, for each city (again, considered in ascending order of ), we need to know two pieces of information:

- The number of previous cities to the left of it ()
- The sum of positions of all previous cities to the left of it (

These two values can be computed using (again) two binary indexed trees in O(log N) each. One BIT stores the number of cities to the left of a certain position, and the other BIT stores the sum of positions of all cities to the left of a certain position. By position I meant the “rank” of a city when the cities are sorted in ascending order of .

**Polygon** (Not Attempted, 0/40 points)

**dolphinigle** posted his solution in a TopCoder forum thread.

**Crab Graphs** (Not Attempted, 0/40 points)

I honestly had no idea on this problem. Several guys discussed this problem on the same thread as Polygon; please check that.

### Company Problems

**Quora Nearby **(Accepted, 40/40 points)

At first I thought this problem must be solved using some sophisticated Kd-tree data structure, with some kind of nearest-neighbor search algorithm. But, when I decided to try a simple naive search (O(N) for topics and O(N^2) for questions), I got accepted!

**Fraud Prevention **(Wrong Answer, 15/30 points)

I submitted like 20+ solutions for this problem, but I couldn’t pass the last test case. I don’t know, may be I missed something in the problem. What I did is a simple O(N^2) search to look for all pairs of fraudulent orders (it didn’t get TLE, though). For each pair, I checked if they satisfy one of the fraudulent requirement.

===

That’s it. If anyone would add alternate solutions or anything, please add them in the comment section ðŸ™‚

Very helpful, can you share code for “Direct Connections”.

Hi deepak,

Here is my code that I submitted during the contest: http://ideone.com/c5Kpf

Hi! Great post ðŸ™‚

I’m still trying to get the coin tosses problem through. Even with your solution it gets rejected :S

Well… c’est la vie.

For the fraud detection problem I did basically the same as you, though I lowercased and trimmed everything, and replaced the full names for their respective abbreviations.

Then it took me like 15 submissions to realize that I wasn’t ordering the results, but after that I passed all the cases .

Then I did the Quora Classify problem. I couldn’t send it for the contest, but I finished it later.

I implemented a basic logistic regression algorithm. The only extra step was to mean-normalize and scale the features.

After doing that I got a perfect score on the test set you can download(200 training examples, 100 features each, with 200 test cases), but it wasn’t passing the small one on the problem description (5 training examples, 23 features each, 2 test cases).

Then I realized that there were features in the small test set that always had the same values. That made the standard deviation to be 0, and thus, since I was using the std to scale my features ((feature1 – mean(feature 1)/std(feature1))), made the algorithm fail, because some values were being divided by 0.

I could have implemented a principal component analysis to solve this, but I decided to just sum 1E-18 to the std of each feature.

Not the most elegant and correct solution, I know, but hey, machine learning is not about precision, it’s about approximation, and if you look at it with an eye closed, 1E-18 is as equal to 0 as it gets ðŸ˜‰

Then I submitted the program and passed the (only) 2 test cases.

Thanks again for sharing your solutions.

for permutation problem i used Simulated annealing and got 20/25 points.

Here is my solution to Groupon fraud Detection.

http://www.interviewpaper.com/fraud

Concerning Fraud Prevention: I made 20+ attempts as well, there were two solution based on sorting (O(NlogN)) and grouping (O(N), since LINQ does this with using a hash-based structure). Both were passing time limit, but failing most of tests.

So I suspect there is something with normalization conditions.

Btw, Quora Nearby also has an incomplete condition related to distance comparison: floating point numbers may have different precision dependently on platform, so distance equality must be defined more strictly.

Hey great post, never heard of BIT before, thanks a lot for sharing this.

However I am having trouble understanding Subsequence Weighting, can you please post the solution?

Thanks!

Hi Marko,

Here is the commented version of my solution: http://ideone.com/PMcfj

If you have problems understanding the code, feel free to ask ðŸ™‚

Thanks! That was a clever usage of BIT, I wouldn’t have though of that

Hi Fushar,

I had a question regarding BITs that I think you might be able to help me with. It concerns your discussion on the following thread:

http://apps.topcoder.com/forums/?module=Thread&threadID=661871&start=0&mc=7#1188813

So, in the card problem alluded to in the above thread, what if I was interested in making Queries of the form: Q i j –> return number of cards between i and j that have been placed face up? If I were to assume that every Query of the form Q i i would take log(N) (using the read function for a BIT), then the worst case scenario for Q i j would be N*log(N). Is there a better way to do this using BITs? Or should I look at other data structures to achieve the hallowed log(N) complexity.

Thanks for your time,

Seb

Hi Sebastian,

For range update + range query, I’m afraid that you must use segment tree data structure.

You can learn about it here: http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=lowestCommonAncestor#Segment_Trees

Oh OK. Will rack my brain to understand this tut.. Thanks a lot. Your blog is seriously an oasis of knowledge. Cheers! Btw, if you enjoyed codesprint, you might want to look at http://www.interviewstreet.com/recruit/challenges/dashboard for some more kickass problems ðŸ™‚

For “Fraud Prevention” I also made 20+ submission… I finally got accepted by fixing a minor bug: when performing string normalization, I replaced a “short form” with “long form”, which was wrong – a substring “illinois” would become “ilillinois” in my WA code. I should do the replacement in a reverse manner.

Hi Fushar ,

can you please share the code for Count Strings ? ðŸ™‚

Hello xmanx,

Sure. Here it is: http://ideone.com/1Fqxw

thanks a lot fushar ðŸ™‚ for the fast reply !

Hi Fushar

I want to learn c++.I know the basic can you provide me some links( videos,books,anything) where i can learn more advance like you used in your programs.

Thanks

Hi Fushar,

Now Im struggling with the Quora problem nearby. In the beginning I also thought that I should implement kd-trees, but after that I read in this post that your solution passed with the naive implementation. Did you implemented it in C++ or Java? Because my trivial implementation is in Java and I got time limit.

Hi Fushar,

Can you pls share code for sequence weighting.