One of the largest sources of complexity for power retailers, traders or anyone involved in the sourcing or delivery of power is pricing. Unlike the crude, refined products and NG markets, power prices are highly volatile and updated almost constantly. If you are responsible for delivering load on the grid, and have not fully sourced or hedged your price exposure, these swings can literally put you out of business.
To obtain the real time and day ahead power prices, you have two choices – you can find a data provider to do all the work or you can scrape the prices yourself. Web scraping is a delicate art, full of gotchas and issues as you are effectively building to a data endpoint not designed for being a data endpoint. Changes to web interfaces occur with little or no forewarning and even small adjustments to HTML structures, tag names or ordering can put you in a world of hurt. As a warning, if you choose to scrape, be ready for the possibility of some real-time production support and fast deployments if a provider changes their UI. And they will.
With all that said, let’s take a look at how someone might scrape. For this example, we will be scraping the Day Ahead Market (DAM) Settlement Point Prices. For the DAM Settlement Prices, they should only need to be scraped daily. First thing is to understand the ERCOT information and how it’s presented. First, there are two views available – one is a web view and then there are csv and xml reports available. We are staying away from the web views (see above warnings) and we are going to concentrate on the csv and xml reporting views.
Let’s start with the url. Here is the url for the ERCOT MIS reports app for DAM Settlement Point Prices.
Here are the thoughts you should be having:
- Do I need all the parameters or just the reportTypeId? (Just the reportTypeId)
- Is the reportTypeId correlated to the DAM Settlement Point Prices? Will each type of report have its own integer based reportTypeId? And do those stay the same? (Not Sure about any of these)
So now we have a baseline. By looking at some of the other reports, you can quickly discover some fairly strong sources that each report type has a unique reportTypeId, and they stay the same. If I know the reportTypeId of the type of reports I want to scape, I can simply replace the url’s reportTypeId with the report id I want and theoretically, I should be able to pull any reports I want. Oh, wait. What is this? When I mouse-over the zip file for the xml, I get another url.
Here are the thoughts you should be having:
- Do I need all the parameters or just the docLookupId? (No) (By the way, the mimic_duns looks like something that, if you know the DUNS of an entity, might mock out their view)
- Is the docLookupId correlated to the particular day’s version and file type of the DAM Settlement Point Prices? (No) Will each report have its own integer based docLookupId? (Yes) And do those have any type of identifiable algorithm? (No) (this is not a great scenario)
Right now, there is a couple of options when you cannot programmatically discern the particular report you are looking for. You can:
- Scrape it all and backfill data gaps programmatically.
- Parse the HTML and find the nodes with the file names, parse the names finding the dates you want between the dots, and then find the zip link in the same html row.
I didn’t like either of these as they are both prone to the html change issue we described earlier. So I started poking around on the source of the list page and found something interesting.
You see that base href url? http://mis.ercot.com:80/misapp/ Well, I did. And I clicked it. And thank goodness I did because my life just got a ton easier.
Looks like ICE has its own little version of the ERCOT MIS website. And if you dig in to the links, you get the first web service (http://mis.ercot.com/misapp/servlets/IceMktRepListWS) that provides us a list of the reportTypeId for each report. In this list, we find the ID for the report we are looking for.
<ns0:Report> <ns0:ReportTypeID>12331</ns0:ReportTypeID> <ns0:ReportName>DAM Settlement Point Prices</ns0:ReportName> <ns0:SecurityClassification>PUBLIC</ns0:SecurityClassification> </ns0:Report>
Now, we can use these reportTypeIds in the other web service to give us a list of the docLookupIds with some supporting information. http://mis.ercot.com/misapp/servlets/IceMktDocListWS?reportTypeId=12331
<ns0:Document> <ns0:ExpiredDate>2015-07-11T23:59:59-05:00</ns0:ExpiredDate> <ns0:ILMStatus>EXT</ns0:ILMStatus> <ns0:SecurityStatus>P</ns0:SecurityStatus> <ns0:ContentSize>91581</ns0:ContentSize> <ns0:Extension>zip</ns0:Extension> <ns0:FileName/> <ns0:ReportTypeID>12331</ns0:ReportTypeID> <ns0:Prefix>cdr</ns0:Prefix> <ns0:FriendlyName>DAMSPNP4190_xml</ns0:FriendlyName> <ns0:ConstructedName>cdr.00012331.0000000000000000.20150610.131348757.DAMSPNP4190_xml.zip</ns0:ConstructedName> <ns0:DocID>483750748</ns0:DocID> <ns0:PublishDate>2015-06-10T13:13:47-05:00</ns0:PublishDate> <ns0:ReportName>DAM Settlement Point Prices</ns0:ReportName> <ns0:DUNS>0000000000000000</ns0:DUNS> <ns0:DocCount>0</ns0:DocCount> </ns0:Document>
Nice. Two steps down, one to go. We can now parse the xml to identify the docID we need by date (PublishDate) and file type (IndexOf ‘xml’ on FriendlyName). And once we have the docId, we can then make the final call to retrieve the zip file using the previously known url:
Using these patterns and having the list of documents, it would be very easy to scrape the site based on dates, looking to fill in data or just parse data on a daily basis. Let’s review the steps.
- Get all the report types and their subsequent reportTypeId.
- Get all the docIDs for the reportTypeId by parsing the xml by date and/or file type.
- For each node, call the third url to pull in the zip file. Unzip. Parse. Save. Done.
With great power comes great responsibility! Be nice to the ERCOT folks and try to limit the scraping to only the data you need, in the lowest intervals you need and store the downloaded raw files in a document server (or amazon S3 bucket or azure blob storage) if you think you might need them again in the future (you will). Disk space is cheap. Also, write two different processes for pulling the data and then parsing it. That way, you can also have other processes pulling in the raw data.
Enjoy your scraping.