xpath定位失败

谷歌浏览器直接提取的xpath,在python中无法提取相应内容

chrome拷贝的xpath里面会添加多余的tbody!!

  • 问题:
    • 在爬取网页的时候利用chrome直接获取的xpath在爬虫中无法直接获得对应内容
  • 详细解释:Scrapy
  • 原因

    • 因为浏览器对不标准的HTML文档都有纠正功能,而lxml不会查看page source,注意是源代码,不是developer tool那个;
    • 最后一个table并没有包含tbody,浏览器会自动补充tbody,而lxml没有这么做,所以你的xpath没有找到
  • 问题栗子~

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # -*- coding: utf-8 -*-
    from lxml import etree
    import requests
    url='http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2014/'
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'}
    html=requests.get(url,headers=headers)
    html.encoding='GBK'
    selector = etree.HTML(html.text)
    content=selector.xpath('//html/body/table[2]/tbody/tr[1]/td/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr/td/a/text()')
    for each in content:
    print each
  • 代码中的其他问题!

    代码中xpath太长,容易出错!
    可以改成:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    from lxml import etree
    import requests
    url='http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2014/'
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'}
    html=requests.get(url,headers=headers)
    html.encoding='GBK'
    selector = etree.HTML(html.text)
    nodes=selector.xpath('//tr[@class="provincetr"]/node()')

    for each in nodes:
    print(each.xpath('string()'))