Professors Chih-Jen Lin, Shou-De Lin ,and Hsuan-Tien Lin of the NTU Department of Computer Science and Information Engineering (CSIE) led a team to the premier international data mining and knowledge exploration competition, the ACM KDD Cup, winning two World Championships. The Algorithm@ National Taiwan University members are Cheng-Hao Tsai, Chun-Liang Li, Ting-Wei Lin, Shan-Wei Lin, Wei-Chen Chang, Kuang-Hao Huang, Chun-Pai Yang, Kuang-Yi Wu, Tsu-Ming Kuo, Yung Chuang, Shu-Hsin Yuan, Wei-Sheng Tan, Tu-Chun Yin, Tung Yu, Cheng-Kuan Wei, Yu-Chen Lu, Jui-Ping Wang, Yang-Shan Lin, Cheng-Hsia Chang, Hsiao-Yu Tung, and Yu-Chuan Su.
NTU has placed in this prestigious event for the sixth consecutive year, winning Championships for the past four years (winning first in 2008, 2010,2011,2012,2013, while winning two event championships in both 2011 and 2013), once again achieving a record for this competition. In August the team members presented their results poster at the ACM SIGKDD KDD2013 conference, and were later selected for the Championship round.
The ACM KDD Cup began in 1997 with the Association for Computing Machinery’s (ACM) Special Interest Group on Knowledge Discovery and Data Mining hosting an annual ACM Conference on Knowledge Discovery and Data Mining. This event is the world’s largest annual data mining competition. The theme for the contest is always the most important issue of the day, involving both challenges of high technical prowess and tremendous commercial applications value. Participants must combine theoretical developments with writing of actual programs, over a period of only three to four months of the competition to complete development of smart data mining technologies and a system, for forecasting use with the competition sponsor’s large data set. Each year hundreds of world class competitors are attracted from academia (the University of Illinois at Urbana-Champaign) and industry (IBM Research) for the fierce competition.
This year the KDD Cup topic was provided by the Microsoft Academic Search division, with the first topic of “Author-Paper Recognition”. This involved a robust data set of over 250,000 authors and 2.5 million academic papers, to train a computer to recognize a paper’s author. The second topic was “Author disambiguation”, involving use of Microsoft Academic Search. This used data from many online sources, resulting in the same author having many different IDs. So this topic relies on the data provided to determine which putative identities belong to the same person in fact.
While the data sets were not as massive as in previous years, the organizers intentionally left the dataset largely uncleaned with many errors remaining, along with missing data, adding to the difficulty of this year’s competition. During the onset of the competition, the NTU team did not give a stellar performance, but with the improvements steadily gained by the team members industrious efforts, the team was finally able to stand out from among the fierce competition of several hundred teams.
Gaining from prior team’s successes, this year Professors Chih-Jen Lin, Shou-De Lin ,and Hsuan-Tien Lin, with support of the College of Electrical Engineering and Computer Science, the Department of Computer Science and Information Engineering (CSIE), and the Graduate Institute of Networking and Multimedia (GINM), launched a course entitled “Machine Learning Theory and Practice”, affording students training and the opportunity to join teams to participate in the KDD Cup. The course participants were all divided into small groups of students with each free to apply their own creativity, for multifaceted approaches to analyze the data and establish models. Through each week’s in-class reporting, the exchange of ideas and experimental results, along with elicitation of novel methods, ensured the existing techniques were improved substantially.
During the latter part of the competition, the teams integrated each of the models, ensuring the forecast results attained optimal performance. The competition was extremely intense, but the second topic competition dealing with “Author disambiguation” was completed first, and in the semi-finals, the final four fought through many sleepless nights of work. Finally, the integration of each team’s models was complete, and NTU won with a mean accuracy rate of 0.992 defeating the University of Illinois at Champaign-Urbana for the championship.
The first topic was completed a little later, and from a week before the close of the competition the realtime standings were no longer announced until the final results. Until a week before the close of competition, the NTU team had been ranked outside the top ten, but the participants managed to develop their team rapport, and ended with a 0.993 mean accuracy rate, winning the second championship.
NTU has won for six consecutive years, creating an impressive record which will be hard to beat.