You are to write an interactive program that, given a sequence of Wikipedia excerpts (see example below), guesses the language of each, in turn. After each guess, your program is given the correct answer, so that it may learn to make better guesses the longer it plays.Each language is represented by a number L between 0 and 55. Each excerpt has exactly 100 symbols, represented as an array E of 100 integers between 1 and 65 535. These integers between 1 and 65 535 have been assigned arbitrarily, and do not correspond to any standard encoding.
You are to implement the procedure excerpt(E) where E is an array of 100 numbers representing a Wikipedia excerpt as described above. Your implementation must call language(L) once, where L is its guess of the language of the Wikipedia edition from which E was extracted. The grading server implements language(L), which scores your guess and returns the correct language. That is, the guess was correct if language(L) = L.
The grading server calls excerpt(E) 10 000 times, once for each excerpt in its input file. Your implementation’s accuracy is the fraction of excerpts for which excerpt(E) guessed the correct language.
You may use any method you wish to solve this problem. Rocchio’s method is an approach that will yield accuracy of approximately 0.4. Rocchio’s method computes the similarity of E to each language L seen so far, and chooses the language that is most similar. Similarity is defined as the total number of distinct symbols in E that appear anywhere amongst the previous excerpts from language L.
Note that the input data have been downloaded from real Wikipedia articles, and that there may be a few malformed characters or fragments of text. This is to be expected, and forms part of the task.
For illustration only, we show the textual representation of excerpts from 56 language-specific editions of Wikipedia.
The sample input file grader.in.1 contains 10 000 such examples. The 56 languages are those listed as “mother tongue” in the IOI 2010 registration data. The language for each excerpt is chosen at random from these 56 languages, and each excerpt is taken from the first paragraph of an article chosen at random from the corresponding Wikipedia edition. Each line of the file contains:
- The two-letter ISO code for the Wikipedia language edition;
- 100 numbers between 1 and 65 535, representing the first 100 symbols, in sequence, of the first paragraph of the article;
- a viewable representation (in UTF-8) of the 100 symbols that you can read in your text editor or Firefox web browser. This viewable representation is for your convenience only, and is not intended to be used as input for your program.
The official grader uses 10 000 different excerpts, selected in the same way from the same 56 Wikipedia editions. However, the grader assigns a different number between 0 and 55 to each language, and a different number between 1 and 65 535 to each symbol.
Subtask 1 [30 points]
Your submission must achieve accuracy of 0.3 or better on the grading server.
Subtask 2 [up to 80 points]
Your score will be 114(α-0.3), rounded to the nearest integer, where α is the accuracy of your submission.
- Use the RunC programming and test environment
- Implementation folder: /home/ioi2010-contestant/language/ (prototype: language.zip)
- To be implemented by contestant: lang.c or lang.cpp or lang.pas
- Contestant interface: lang.h or lang.pas
- Grader interface: grader.h or graderlib.pas
- Sample grader: grader.c or grader.cpp or grader.pasandgraderlib.pas
- Sample grader input: grader.in.1.
Note: Each line of input contains: a two-character language code; an excerpt represented as 100 numbers separated by spaces; the text representation of the excerpt.
- Expected output for sample grader input: If the implementation calls language as specified for each of the 10 000 examples, the sample grader will output OK alpha where alpha is the accuracy.
- Compile and run (command line): runc grader.c or runc grader.cpp or runc grader.pas
- Compile and run (gedit plugin): Control-R, while editing any implementation file.
- Submit (command line): submit grader.c or submit grader.cpp or submit grader.pas
- Submit (gedit plugin): Control-J, while editing any implementation or grader file.