Skip to main content

Java Web Crawler using JSoup

A much needed program for a business application is the infamous web-crawler. There are a few paid programs that accomplish this task. I wanted to create a web-crawler that is expandable; currently it traverses the website and gathers the different links. Future updates will provide capabilities such as:

  • Checking broken image links
  • Gathering useful information from each page
  • Checking broken links
  • Checking for repetitive information
Each one of these is crucial for SEO and now there's going to be an automated way to check for each.

To begin, download the website-crawler from https://github.com/dinocajic/java-website-crawler

Create a project and copy the files in the src folder to your IDE. Open the CrawlSite.java file and change the websiteAddress property. That's it. Compile and run.

To run through the program, Main.java instantiates CrawlSite.java. In CrawlSite.java, the output.txt file gets created so that the links can be stored in the text file. The URL is passed to the storeLinksFromPage() method and the fun begins. If you want to traverse only a certain amount of links, you can specify the stopAfter property to a number that you're comfortable with (i.e. 500). The link is stored to the output.txt file and also added to a visited pages linked list so that the crawler doesn't have to visit it again.

All of the elements with the "a" tag are grabbed and the link is extracted from the "href" attribute. The method goes through each of the links. Once it makes sure that the link goes to another page, the link is inserted to the visited pages linked list (for future use). Also, if the page hasn't been visited it recursively calls itself again to begin the process of getting the links from within the new page.

Comments

  1. T-Shirt | titanium pans | T-Shirts | T-Shirts | T-Shirt
    T-Shirts | T-Shirts | T-Shirts titanium plate flat iron | T-Shirts | titanium ore T-Shirts | T-Shirts | หารายได้เสริม T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | is titanium expensive T-Shirts | T-Shirts | titanium trim hair cutter reviews T-Shirts | T-Shirts

    ReplyDelete

Post a Comment

Popular posts from this blog

Creating your own ArrayList in Java

Wanted to show that certain data structures in Java can be created by you. In this example, we'll go ahead and create an ArrayList data structure that has some of the methods that the built in ArrayList class has. We'll create 2 constructors: The default constructor that creates an ArrayList with a default size of 10. Constructor that allows an initial size to be passed to the array. We'll also create a number of methods: void add(Object x);  A method that allows you to place an Object at the end of the ArrayList. void add(int index, Object x);  A method that allows you to place a value at a given location. Object get(int index):  Allows you to retrieve a value of the arrayList array from a given location. int size();  Allows you to get the number of elements currently in the Arraylist. boolean isEmpty();  Tests to see if the Arraylist is empty. boolean isIn(Object x);  A method that sees if a particular object exist in the arrayList. int ...

Laravel 6.x with React and react-router

This will get you started on getting your first React/Laravel application deployed to your server. We'll cover everything from installation to deployment. Start by reading the installation instructions on  https://laravel.com/docs/6.x#installing-laravel . We'll cover those details below. Setting Up Laravel Check that you have the latest version of PHP installed on your computer.  It must be >= 7.2.0. Open terminal to get the Laravel installation tool. Type in composer global require laravel/installer Type in laravel to verify installation. Navigate to a directory on your computer where you want to install your project on your terminal. Run the following command: laravel new project_name (replace project_name with your project name). Once complete, cd into your new project. Type the following command: php artisan serve. You'll get a message like the following if it's running successfully: Laravel development server started: http://127.0.0.1:8000 ...

Programming Language Concepts Test Questions/Answers

One of the easiest methods that I use to learn new topics is by creating notes on the subject and then by turning those notes into questions and answers. Remembering answers to questions just seems more natural. I was able to memorize 323 questions and answers in a matter of a couple of days. I wanted to start doing this for some topics that I find pretty interesting. To begin, here are some questions and answers to Programming Language Concepts (PLC). I'm reading your mind right now and the answer is yes, there will be more. 1. Name 3 reasons for studying PLC. - Better understanding of current programming languages - Advancement of computing - Increased capability to express ideas - Increased capability to learn new programming language. - Better understanding of which programming language to choose.  2. Name the 5 programming domains and languages best suited for each. - Scientific (Fortran, ALGOL 60) - Business (COBOL) - AI (Lisp, Scheme, Prolog) - Web (PHP, ...