The winner of our last programming challenge is Jichun Wang. Jichun is a Sun Microsystems alumnus, now a senior software engineer at Synopsis. He uses perl to call a Google RESTful API and to parse the JSON outcome from google to get the result counting of google search. You can find his program here on my SAS_ACADEMY group.
Because the API he used is deprecated, you can find there is a significant difference in the counting using this API and that you get by googling directly in browser; however, he got the order correct!
|A-Hero||Cnt By API||Cnt By Browser|
|Iron Man||18,700,000||266 M|
First of all, Megha Agarwal is the winner of our first programming challenge: Web Crawling (Level 0). She gets the laureate list for every year together with the hyperlinks. I will publish her code together with my comments on this challenge soon.
This time I would like to move one level deep into the web by chasing those links we got from our first challenge. Start from the same domain http://nobelprize.org/nobel_prizes/physics/laureates/. Take 1999 for example. If you click the year, you get to the next web page where you find a brief intro of the achievement of ‘t Hooft and Veltman: “for elucidating the quantum structure of electroweak interactions in physics”.
So here is the challenge: Again write in any programming language you feel most comfortable to loop through the hyperlinks from 1901 to 2010, extract the achievement for each year’s laureates from the next level.
Please send your code to me. The deadline is May 1, 2011 and the prize is still a $25 Fry’s gift card again!
Yes, SAS can read web pages by http request. This means you can retrieve information and analyze data directly from web. Isn’t it cool?
Alright, here is the game: Take the web page “All Nobel Prizes in Physics” (sorry for the idiosyncrasy in this choice ) and focus on the list of year and the laureates for each year. How can you get the simple descriptive statistics like “how many years there is no Nobel Prize for Physics awarded”, “how many years there are three laureates”, et cetera from this list?
- A programmer’s answer: Write a program
- A SAS programmer’s answer: Write a SAS program
- The most unacceptable answer: Copy and paste and count by hand
- Answer a la copy and paste but a little smarter: Copy and paste into vi. Then use the command
and save the result as a csv file. The first three lines of this csv look like
2010,Andre Geim, Konstantin Novoselov
2009,Charles Kuen Kao, Willard S. Boyle, George E. Smith
2008,Yoichiro Nambu, Makoto Kobayashi, Toshihide Maskawa
As long as you get csv the job is done ’cause SAS can handle the rest for sure.
OK. Now seriously, here is the challenge. Write in any programming language you feel most at home a program to read this webpage, such that:
- The input is the url.
- The output shall contain the link behind the year, the year, and the list of laureates (or a note saying there is none) for that year.
- The result shall be computer readable, for example, sas dataset or csv file:
“/nobel_prizes/physics/laureates/2010/”,2010,”Andre Geim, Konstantin Novoselov”
“/nobel_prizes/physics/laureates/2009/”,2009,”Charles Kuen Kao, Willard S. Boyle, George E. Smith”
“/nobel_prizes/physics/laureates/2008/”,2008,”Yoichiro Nambu, Makoto Kobayashi, Toshihide Maskawa”
- A research on the html source underneath this particular web page is quite necessary.
- I call this level 0 web crawling ’cause only one single web page is involved
- It is better we set a cutoff date—let’s take one month.
Have fun and happy coding
Update (3/8/2011): If you like to participate, you can send your code to
me. You can check my credential from my LinkedIn profile. Even though for now this challenge is just an initiative, some of my
colleagues are very enthusiastic about making it into a formal event. So stay tuned
Update (3/10/2011): Thank Sophie, Melina and Leila. Now this challenge goes formal. The deadline is April 1, 2011 and the prize is a $25 Fry’s gift card!
- Best Practices (3)
- Best-Practices (16)
- BioNews (3)
- Business Best Practices (5)
- Case studies (2)
- CDISC (11)
- Clinical Data Management (6)
- Clinical Stories (1)
- Code (13)
- EDC (7)
- Event (3)
- Events (7)
- Menu (3)
- Monthly Contest (12)
- New Technologies (15)
- OpenClinica (2)
- SAS Library (4)
- Scripting (2)
- Tips & Techniques (14)
- Trends (11)