onomnom: Crawl....

Wednesday, May 2, 2012

Crawl....

Completed the following:

Given a starting URL and url filter format (which help to keep inside of the website domain), a multithreaded crawler explores inside a target website.
For each website:

It parses the price.
Store inside of the database using connection from database connection pool. The database maintain a set of connection that is ready to use.
Add the outlink to the queue that is going to be visit.

Set up github in the cloud

The code is in the server already.

To Do:

There is bug when the crawler finished, which seems to come from multiple processes not stopping correctly.
Currently, the crawler is only for one website. We need to write a simple for loop to make it automatically crawler a set of website domains.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)