I need to do this but i dont even know what its called

Discussion to talk about software related topics only.
seulater
Posts: 445
Joined: Fri Apr 25, 2008 5:26 am

I need to do this but i dont even know what its called

Post by seulater »

I want to create a stock watcher. If i open wire shark and then put this into my browser:
http://query.yahooapis.com/v1/public/yq ... eswithkeys

It comes back as a nicely formatted XML page. So i figured i would just do a string compare for the fields and be done. Well, as it turns out the data that i am receiving is not what is showing up in that html page. From viewing the data in wire shark its a bunch of gibberish. I am guessing some kind of formatting of some kind but i don't have a clue what to even google to search for it.

Anyone have any ideas ?
v8dave
Posts: 333
Joined: Thu Dec 31, 2009 8:31 pm

Re: I need to do this but i dont even know what its called

Post by v8dave »

Looking at the source in a browser it looks like regular XML so you should be able to parse it.

What are you seeing with Wireshark?

Dave...
seulater
Posts: 445
Joined: Fri Apr 25, 2008 5:26 am

Re: I need to do this but i dont even know what its called

Post by seulater »

Hi Dave. More research on this turns out that the data is gzip. The response header is as follows.

Code: Select all

HTTP/1.1 200 OK
X-YQL-Host: engine4.yql.ac4.yahoo.com
Access-Control-Allow-Origin: *
Cache-Control: no-cache
Content-Type: text/xml;charset=utf-8
Content-Encoding: gzip
Vary: Accept-Encoding
Date: Wed, 04 Jul 2012 14:46:33 GMT
Server: YTS/1.20.10
Age: 0
Transfer-Encoding: chunked
Connection: keep-alive
the interesting part here is the "Content-Encoding: gzip" because after this header information the rest of the data is just gibberish.
Here is the screen shot of the response.
Attachments
7-4-2012 9-54-34 AM.jpg
7-4-2012 9-54-34 AM.jpg (214.97 KiB) Viewed 5770 times
v8dave
Posts: 333
Joined: Thu Dec 31, 2009 8:31 pm

Re: I need to do this but i dont even know what its called

Post by v8dave »

Ah ok. It would appear that the browser is extracting the gzip file which seems a little strange as normally it should ask if I want to save this type of file so I guess this must be an internal browser method of handling data.

The gibberish is obviously the zipped data so looks like you need a library or such to unzip the data.

Dave...
seulater
Posts: 445
Joined: Fri Apr 25, 2008 5:26 am

Re: I need to do this but i dont even know what its called

Post by seulater »

ya, that link in the first post is using YQL to get stock data. But to my knowledge there is no way to tell it not to gzip the response.
the only options are XML and JSON but both are gzip'ed
seulater
Posts: 445
Joined: Fri Apr 25, 2008 5:26 am

Re: I need to do this but i dont even know what its called

Post by seulater »

FWIW, this is what i did and now life is good again.

If one looks at the GET message sent to the server is this:

Code: Select all

GET /v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20in%20(%22AAPL%22)&env=store://datatables.org/alltableswithkeys&Accept=text/plain HTTP/1.1
Host: query.yahooapis.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
If one simply changes the "Accept-Encoding: gzip, deflate" to "Accept-Encoding: gzipdeflate" this will make the server not respond with gzip data.
User avatar
dciliske
Posts: 623
Joined: Mon Feb 06, 2012 9:37 am
Location: San Diego, CA
Contact:

Re: I need to do this but i dont even know what its called

Post by dciliske »

Out of curiosity, what language/framework are you using to obtain and extract data from the xml feed? Are you trying to do this on a Netburner?

If you're not doing it on a Netburner, here's the entire package as a single Python module: http://goldb.org/ystockquote.html.

Also, if you ever get into more advanced scraping challenges, send me a PM; it's what I did previously and still do occasionally for personal projects.
Dan Ciliske
Project Engineer
Netburner, Inc
seulater
Posts: 445
Joined: Fri Apr 25, 2008 5:26 am

Re: I need to do this but i dont even know what its called

Post by seulater »

I am using basic, and no its not a NB platform.
User avatar
dciliske
Posts: 623
Joined: Mon Feb 06, 2012 9:37 am
Location: San Diego, CA
Contact:

Re: I need to do this but i dont even know what its called

Post by dciliske »

Dave and seulater,

Here's a rough overview of the portions of the http protocol you're seeing. When sending a request, the http client (in your initial case, a web browser) asks for a given file (usually implying a page) and also tells the server it's capabilities for handling a response. The Host line you see is so the server can distinguish which host to grab the file from (in the event that there are multiple hosts listening on the incoming NIC). The User-Agent tells the server what application is making the request, along with what OS it is running on (usually). The Accept line tells the server what MIME types it can handle and what it's preference is. Note the q= x.y portions; these indicate a level of preference for the preceding MIME type. Accept-Language gives the preferred languages (aka, what the user can read). These are used primarily by the web developer to choose the default language for a multi-language website.

Finally, we get to the Accept-Encoding header. Accept-Encoding tells the server what file encodings the client can handle. The reason that browsers extract the gzip data is that by accepting compressed data, pages will load faster over the same bandwidth. Server's like to compress data because it means they can handle a larger number of connections with the same bandwidth. Therefore, if you say you can handle compressed data, the server's likely to give it to you.

While it's fairly dry and a long read, I'd recommend taking a look at the RFC on the http protocol (http/1.1 at least:http://www.ietf.org/rfc/rfc2616.txt). It explains a lot for how to make requests like this and what the responses mean.

One last note is that while, you're not actually Firefox when making these requests, some poorly coded websites will refuse to load if you don't impersonate a major browser when making the requests (aka, using that browser's User Agent string).
v8dave wrote:Ah ok. It would appear that the browser is extracting the gzip file which seems a little strange as normally it should ask if I want to save this type of file so I guess this must be an internal browser method of handling data.

The gibberish is obviously the zipped data so looks like you need a library or such to unzip the data.

Dave...
Dan Ciliske
Project Engineer
Netburner, Inc
User avatar
dciliske
Posts: 623
Joined: Mon Feb 06, 2012 9:37 am
Location: San Diego, CA
Contact:

Re: I need to do this but i dont even know what its called

Post by dciliske »

seulater wrote:I am using basic, and no its not a NB platform.
If by basic you mean VisualBasic and not VBA, you might want to take a look at the HTML Agility Pack for .Net. I've used it in every single web scraper I've done in .Net.
Dan Ciliske
Project Engineer
Netburner, Inc
Post Reply