问题描述:

Working on this assignment for a while now. The regex is not particularly difficult, but I don't quite follow how to get the output they want

Your program should:

  • Read the html of a webpage (which has been stored as textfile);
  • Extract all the domains referred to and list all the full http addresses related to these domains;
  • Extract all the resource types referred to and list all the full http * addresses related to these resource types.

Please solve the task using regular expressions and re functions/methods. I suggest using ‘finditer’ and ‘groups’ (there might be other possibilities). Please do not use string functions where re is better suited."

The output is supposed to look like this

www.fairfaxmedia.co.nz

http://www.fairfaxmedia.co.nz

www.essentialmums.co.nz

http://www.essentialmums.co.nz/

http://www.essentialmums.co.nz/

http://www.essentialmums.co.nz/

www.nzfishingnews.co.nz

http://www.nzfishingnews.co.nz/

www.nzlifeandleisure.co.nz

http://www.nzlifeandleisure.co.nz/

www.weatherzone.co.nz

http://www.weatherzone.co.nz/

www.azdirect.co.nz

http://www.azdirect.co.nz/

i.stuff.co.nz

http://i.stuff.co.nz/

ico

http://static.stuff.co.nz/781/3251781.ico

zip

http://static2.stuff.co.nz/1392867595/static/jwplayer/skin/Modieus.zip

mp4

http://file2.stuff.co.nz/1394587586/272/9819272.mp4

I really need help with how to filter stuff out so the output shows up like that?

网友答案:
  1. create list of tuples (keyword, url)
  2. sort it according to keyword
  3. using itertools.groupby group per keyword
  4. for each keyword, print keyword and then all urls (these to be printed indentend).
相关阅读:
Top