问题描述:

I am using a YQL query (the standard example query, with GOOG, YHOO, MSFT and AAPL) to generate XML for all of the available fields. I wanted to scrape the YQL site for the XML output once it is generated using a Ruby script, so that I could run it over and over again for different stocks and store the data somewhere. I haven't finished my script yet, but what I have seems to just not run. Here is the code:

yahoo_finance_scrape.rb

require 'rubygems'

require 'nokogiri'

require 'restclient'

PAGE_URL = "http://developer.yahoo.com/yql/console/"

yql_query = 'use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"

as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") '

if page = RestClient.post(PAGE_URL, {'name' => yql_query, 'submit' => 'Test'})

puts "YQL query: #{yql_query}, is valid"

xml_output = Nokogiri::HTML(page)

lines = xml_output.css('#container #layout-doc #yui-gen3000008 #yui-gen3000009 #yui_3_11_0_3_1393417778356_354

#yui-gen3000015 #yui-gen3000016 div#yui_3_11_0_2_1393417778356_10 #centerBottomView

#outputContainer div#output #outputTabContent #formattedView #viewContent #prexml')

lines.each do |line|

puts line.css('span').map{|span| span.text}.join(' ')

sleep 0.03

end

end

When I run the program, it only prints

"YQL query: use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"

as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") , is valid"

And then just stops. Oh, I am using that Github url because yahoo.finance.quotes was not working, and someone else on Stackoverflow suggested to use it.

If you want to check the css tags, just go to http://developer.yahoo.com/yql/console/ and enter my query and do an inspect element on it. I would post it here, but I don't know how.

网友答案:

The output is just the content of your yql_query var. so this does not help much.

You probably should not put the "use xxxx ax quotes" as a string in your code. Check out what "someone else" had in mind.

The RestClient.post() method returns a response object. With all HTTP operations, always check the response.code, otherwise you don't know about errors.

response = RestClient.post(...)
puts "HTTP Response code: #{response.code}"
if response.code == 200
    page = repsonse.to_str
    ...
end

According to the Nokogiri website the xml_output.css() method filters like it is a css selector. if you have for example "#container #layout-doc", this means "filter elements with the id 'layout-doc' inside elements of the id 'container' and so on. Is this really what you itend to do? if yes, the last "#prexml" should be enough and much less error-prone, as ids should normally be unique.

相关阅读:
Top