Skip to main content

Java Web Crawler using JSoup

A much needed program for a business application is the infamous web-crawler. There are a few paid programs that accomplish this task. I wanted to create a web-crawler that is expandable; currently it traverses the website and gathers the different links. Future updates will provide capabilities such as:

  • Checking broken image links
  • Gathering useful information from each page
  • Checking broken links
  • Checking for repetitive information
Each one of these is crucial for SEO and now there's going to be an automated way to check for each.

To begin, download the website-crawler from https://github.com/dinocajic/java-website-crawler

Create a project and copy the files in the src folder to your IDE. Open the CrawlSite.java file and change the websiteAddress property. That's it. Compile and run.

To run through the program, Main.java instantiates CrawlSite.java. In CrawlSite.java, the output.txt file gets created so that the links can be stored in the text file. The URL is passed to the storeLinksFromPage() method and the fun begins. If you want to traverse only a certain amount of links, you can specify the stopAfter property to a number that you're comfortable with (i.e. 500). The link is stored to the output.txt file and also added to a visited pages linked list so that the crawler doesn't have to visit it again.

All of the elements with the "a" tag are grabbed and the link is extracted from the "href" attribute. The method goes through each of the links. Once it makes sure that the link goes to another page, the link is inserted to the visited pages linked list (for future use). Also, if the page hasn't been visited it recursively calls itself again to begin the process of getting the links from within the new page.

Comments

  1. T-Shirt | titanium pans | T-Shirts | T-Shirts | T-Shirt
    T-Shirts | T-Shirts | T-Shirts titanium plate flat iron | T-Shirts | titanium ore T-Shirts | T-Shirts | หารายได้เสริม T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | T-Shirts | is titanium expensive T-Shirts | T-Shirts | titanium trim hair cutter reviews T-Shirts | T-Shirts

    ReplyDelete

Post a Comment

Popular posts from this blog

Beginner Java Exercise: Sentinel Values and Do-While Loops

In my previous post on while loops, we used a loop-continuation-condition to test the arguments. In this example, we'll loop at a sentinel-controlled loop. The sentinel value is a special input value that tests the condition within the while loop. To jump right to it, we'll test if an int variable is not equal to 0. The data != 0 within the while (data != 0) { ... } is the sentinel-controlled-condition. In the following example, we'll keep adding an integer to itself until the user enters 0. Once the user enters 0, the loop will break and the user will be displayed with the sum of all of the integers that he/she has entered. As you can see from the code above, the code is somewhat redundant. It asks the user to enter an integer twice: Once before the loop begins, and an x amount of times within the loop (until the user enters 0). A better approach would be through a do-while loop. In a do-while loop, you "do" something "while" the condition

Programming Language Concepts Questions/Answers Part 3

1. What is an associative array? - An unordered collection of data elements that are indexed by keys. 2. Each element of an associative array is a pair consisting of a _______ and a _______. - key and a value 3. True or False? Java supports associative arrays? - True. As a matter of fact, Perl, Python, Ruby, C++, C# and F# do too. 4. What are associative arrays called in Perl? - hashes 5. Why are associative arrays in Perl called hashes? - Because their elements are stored and retrieved with a hash function 6. What character does a hash in Perl begin with? % 7. In Perl, each key is a _____ and each value is a _______. - string - scalar 8. In Perl, subscripting is done using _______ and _______. - braces and keys 9. In Perl, how are elements removed from hashes? - using delete 10. In Perl, the ________ operator tests whether a particular value is a key in a hash. - exists 11. What are associative arrays called in Python? - dictionaries 12. What is a dif

Creating your own ArrayList in Java

Wanted to show that certain data structures in Java can be created by you. In this example, we'll go ahead and create an ArrayList data structure that has some of the methods that the built in ArrayList class has. We'll create 2 constructors: The default constructor that creates an ArrayList with a default size of 10. Constructor that allows an initial size to be passed to the array. We'll also create a number of methods: void add(Object x);  A method that allows you to place an Object at the end of the ArrayList. void add(int index, Object x);  A method that allows you to place a value at a given location. Object get(int index):  Allows you to retrieve a value of the arrayList array from a given location. int size();  Allows you to get the number of elements currently in the Arraylist. boolean isEmpty();  Tests to see if the Arraylist is empty. boolean isIn(Object x);  A method that sees if a particular object exist in the arrayList. int find(Object x);