Wednesday, May 2, 2012

Crawl....

Completed the following:

  • Given a starting URL and url filter format (which help to keep inside of the website domain), a multithreaded crawler explores inside a target website. 
  • For each website:
    • It parses the price.
    • Store inside of the database using connection from database connection pool. The database maintain a set of connection that is ready to use.
    • Add the outlink to the queue that is going to be visit.
  • Set up github in the cloud
    • The code is in the server already. 
To Do:

  • There is bug when the crawler finished, which seems to come from multiple processes not stopping correctly.
  • Currently, the crawler is only for one website. We need to write a simple for loop to make it automatically crawler a set of website domains.

No comments:

Post a Comment