Fabulous PEAR Usage
Today at work I got tired of going through an arduous process to download CSV raw data files from our Metrica †. So I implemented a bit of screen scraping to get the job done.
There are a series of issues that making the automatic retrieval of information from this particular database difficult.
- Firstly, there is only a web frontend available. You must select the type of report and subreport then the market, the format of the report and the date you wish to get data from.
- The URL is the key to getting this report rerun at anytime. The URL contains a number of parameters, all of which you just selected in the frames interface. Therefore, you can construct your own URL as you please and pull down the HTML Page with the results as you please. As a result a CSV file with a random filename is generated and you can click the link in the JS menu to at get it.
- One problem however! When sending the URL, the report must be regenerated. The URL first takes you to a page with nothing but a REFRESH meta-tag which points to a new URL (all of the same parameters but a different script). I can only assume that the first URL kick-starts the report generation or something and then points you to the results. (This doesn’t quite make sense, since the first page loads really fast and the second page very slowly).
Originally I was using the PHP
file() which yielded really weird and inconsistant results so I switched to the really cool PEAR class Net_Curl which greatly simplifies interactions with PHP’s built-in CURL functions in a nice OOP API.
This seemed to fix the problem and I was able to consistently retreive my CSV file ;) Then I just packaged it up into a nice neat little function and now I can schedule daily downloads of data from the previous day (The data is roughly 3hours behind real-time) before I come into work and automatically pull the latest data twice-daily to auto-generate some XLS files ( I did this with another neato PEAR class called Spreadsheet_Excel )
php -f getOnlineCSV.php -- -r # Get the latest stats
php -f getOnlineCSV.php -- -y # Get yesterday's stats
Currently those are the only parameters I’ve implemented but I think I might add a -month, -day, -year set of params. This will be easily implemented with a PEAR class: Console_GetArgs
Did I mention that Console_ProgressBar is also freaking cool? :)
† A complex system we use to poll, warehouse, and distribute data and performance statistics about our cell sites. Links about T-Moble and Metrica: