Crawl....
Completed the following:
- Given a starting URL and url filter format (which help to keep inside of the website domain), a multithreaded crawler explores inside a target website.
- For each website:
- It parses the price.
- Store inside of the database using connection from database connection pool. The database maintain a set of connection that is ready to use.
- Add the outlink to the queue that is going to be visit.
- Set up github in the cloud
- The code is in the server already.
To Do:
- There is bug when the crawler finished, which seems to come from multiple processes not stopping correctly.
- Currently, the crawler is only for one website. We need to write a simple for loop to make it automatically crawler a set of website domains.
No comments:
Post a Comment