Fabulous PEAR Usage

June 9, 2004 • • 586 words • about a 2 minute read

Today at work I got tired of going through an arduous process to download CSV raw data files from our Metrica †. So I implemented a bit of screen scraping to get the job done.

There are a series of issues that making the automatic retrieval of information from this particular database difficult.

Firstly, there is only a web frontend available. You must select the type of report and subreport then the market, the format of the report and the date you wish to get data from.
You then click on a button in this frame and a new window is launched. The server churns for a bit and finally you see a HTML page with a javascript-drop-down-menu thing and the data display within some <pre> tags.
There are two options for getting this data and they are both inside the javascript drop down. You can either download a CSV file of the results or get a URL you can send to someone which will regenerate this particular report and show them the HTML page (with CSV links and stuff).
The URL is the key to getting this report rerun at anytime. The URL contains a number of parameters, all of which you just selected in the frames interface. Therefore, you can construct your own URL as you please and pull down the HTML Page with the results as you please. As a result a CSV file with a random filename is generated and you can click the link in the JS menu to at get it.
One problem however! When sending the URL, the report must be regenerated. The URL first takes you to a page with nothing but a REFRESH meta-tag which points to a new URL (all of the same parameters but a different script). I can only assume that the first URL kick-starts the report generation or something and then points you to the results. (This doesn’t quite make sense, since the first page loads really fast and the second page very slowly).
Once you get the HTML page with the plaintext results you can scrape out the CSV URL from the inline’d javascript and download the CSV file.

Originally I was using the PHP file() which yielded really weird and inconsistant results so I switched to the really cool PEAR class Net_Curl which greatly simplifies interactions with PHP’s built-in CURL functions in a nice OOP API.

This seemed to fix the problem and I was able to consistently retreive my CSV file ;) Then I just packaged it up into a nice neat little function and now I can schedule daily downloads of data from the previous day (The data is roughly 3hours behind real-time) before I come into work and automatically pull the latest data twice-daily to auto-generate some XLS files ( I did this with another neato PEAR class called Spreadsheet_Excel )

php -f getOnlineCSV.php -- -r # Get the latest stats

php -f getOnlineCSV.php -- -y # Get yesterday's stats

Currently those are the only parameters I’ve implemented but I think I might add a -month, -day, -year set of params. This will be easily implemented with a PEAR class: Console_GetArgs

Did I mention that Console_ProgressBar is also freaking cool? :)

† A complex system we use to poll, warehouse, and distribute data and performance statistics about our cell sites. Links about T-Moble and Metrica: [1] [2] [3] [4]