×

Discussion Board

Results 1 to 2 of 2

Thread: pyparsing

  1. #1
    Registered User
    Join Date
    Feb 2008
    Posts
    1

    pyparsing

    im tryin to use pyparsing to grab cnn's top 5 headlines. Anyone know of a way to do this?

    also, is it possible to grab only the urls in between

    startnews = '<div class="cnnSubHead">Latest News</div>'
    endnews = '/24hours/'

    my code is only grabbing all on page
    -------------------------------------------------------------

    Code:
    
    from pyparsing import Word, Suppress, CharsNotIn # import what we need
    import urllib
    
    startnews = '<div class="cnnSubHead">Latest News</div>'
    endnews = '/24hours/'
    
    filter1 = Suppress('>') # filter out stuff we dont want to show up
    filter2 = Suppress('</a>')
    
    pattern = filter1 + CharsNotIn('<').setResultsName('newslisting') + filter2 # setup search pattern
    
    cnnurl = 'http://www.cnn.com/' # url to search
    
    cnnconnect = urllib.urlopen(cnnurl) # connect to url
    
    readpage = cnnconnect.read() # read html src into list
    
    cnnconnect.close() # close connection to resource
    
    for theloop,startnews,endnews in pattern.scanString(readpage): # loop through resource
    
        print '[+]', theloop.newslisting # display results
    ----------END CODE----------------
    scripteaze

  2. #2
    Regular Contributor
    Join Date
    Jan 2004
    Location
    Helsinki
    Posts
    376

    Red face Re: pyparsing

    Since this is pyparsing specific question, it might catch more fire in pyparsing related forum.
    Mikko Ohtamaa

    http://mfabrik.com
    http://blog.mfabrik.com

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •