问题描述:

I hope this isn't a duplicate question. I've spent quite a bit of time looking around for a working solution but I'm having no luck. What I'm trying to do is loop through each xml Node and get a specific node out. To achieve this I'm using Ruby, Nikogiri and xpath.

So I have and simple xml file that looks like this sitemap.xml:

<?xml version="1.0" encoding="UTF-8"?>

<url>

<loc>http://www.stackoverflow.com/questions/ask1/</loc>

</url>

<url>

<loc>http://www.stackoverflow.com/questions/ask2/</loc>

</url>

<url>

<loc>http://www.stackoverflow.com/questions/ask3/</loc>

</url>

So I'm trying to extract each . This is my code:

siteMap = 'sitemap.xml'

sm = File.open(siteMap)

docSM = Nokogiri::XML(sm)

siteMapLinks = docSM.xpath("/url/loc").inner_text

print siteMapLinks.to_s + "\n"

The output >

http://www.stackoverflow.com/questions/ask1/

So as you can see it does not output all nodes/tags. I have tried putting the code in a for loop but all it does is repeat the same node. Any idea how to get my desired output:

Desired output >

http://www.stackoverflow.com/questions/ask1/

http://www.stackoverflow.com/questions/ask2/

http://www.stackoverflow.com/questions/ask3/

网友答案:

That was close but misses out on a few petty details. Nokogiri parses your XML until the first top level tag is closed, so if you want it to parse all the URLs you will need some encapsulating tag as in

<?xml version="1.0" encoding="UTF-8"?>
<urls>
  <url>
    <loc>http://www.stackoverflow.com/questions/ask1/</loc>
  </url>
  <url>
    <loc>http://www.stackoverflow.com/questions/ask2/</loc>
  </url>
  <url>
    <loc>http://www.stackoverflow.com/questions/ask3/</loc>
  </url>
</urls>

Now you can query your document with

docSM.xpath("//url/loc").each do |node|
  puts node.inner_text
end

If you do

docSM.xpath("//url/loc").inner_text

as you suggested you will get a single string with all the text concatenated and no separator in between.

网友答案:

Your file is not a valid XML document because it contains more than one root node. If you inspect the contents of your docSM variable you should be able to see that Nokogiri has only parsed the first <url> because it's the first root node.

You need to contain all of the <url>s in a higher-level node to create a valid document. I.e

<urls>
  <url>...</url>
  <url>...</url>
</urls>
网友答案:

Your XML isn't valid. You can test that by looking at the errors method of your document:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<url>
  <loc>http://www.stackoverflow.com/questions/ask1/</loc>
</url>
<url>
  <loc>http://www.stackoverflow.com/questions/ask2/</loc>
</url>
EOT

doc.errors # => [#<Nokogiri::XML::SyntaxError: Extra content at the end of the document>]
相关阅读:
Top