jsoup : Send search query to Google

摘要: This example shows you how to use jsoup to send a search query to Google.

This example shows you how to use jsoup to send a search query to Google.

	Document doc = Jsoup
		.connect("https://www.google.com/search?q=mario");
		.userAgent("Mozilla/5.0")
		.timeout(5000).get();
Unusual traffic from your computer network
Don’t use this example to spam Google, you will get above message from Google, read this Google answer.

1. jsoup example

Example to send a “mario” search query to Google, parse the search result and filters out the domain name.

FunnyCrawler.java
package com.mkyong;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class FunnyCrawler {
  private static Pattern patternDomainName;
  private Matcher matcher;
  private static final String DOMAIN_NAME_PATTERN 
	= "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}";
  static {
	patternDomainName = Pattern.compile(DOMAIN_NAME_PATTERN);
  public static void main(String[] args) {
	FunnyCrawler obj = new FunnyCrawler();
	Set<String> result = obj.getDataFromGoogle("mario");
	for(String temp : result){
		System.out.println(temp);
	System.out.println(result.size());
  public String getDomainName(String url){
	String domainName = "";
	matcher = patternDomainName.matcher(url);
	if (matcher.find()) {
		domainName = matcher.group(0).toLowerCase().trim();
	return domainName;
  private Set<String> getDataFromGoogle(String query) {
	Set<String> result = new HashSet<String>();	
	String request = "https://www.google.com/search?q=" + query + "&num=20";
	System.out.println("Sending request..." + request);
	try {
		// need http protocol, set this as a Google bot agent :)
		Document doc = Jsoup
			.connect(request)
			.userAgent(
			  "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
			.timeout(5000).get();
		// get all links
		Elements links = doc.select("a[href]");
		for (Element link : links) {
			String temp = link.attr("href");		
			if(temp.startsWith("/url?q=")){
                                //use regex to get domain name
				result.add(getDomainName(temp));
	} catch (IOException e) {
		e.printStackTrace();
	return result;

Output

Sending request...https://www.google.com/search?q=mario&num=20
www.imdb.com
www.mariobatali.com
www.freemario.org
www.mariogames.be
mario.wikia.com
stabyourself.net
webcache.googleusercontent.com
www.youtube.com
www.huffingtonpost.com
www.mariowiki.com
mario.lancashire.gov.uk
amirulhafiz.deviantart.com
www.mariohugo.com
mariofoods.com
mario.nintendo.com
www.mario2u.com
www.botta.ch
en.wikipedia.org
www.mariotestino.com
www.hubmario.com
www.mariolemieux.org
pouetpu.pbworks.com
23

上一篇: Count IP address in Nginx access logs
下一篇: Java Read a file from resources folder
 评论 ( What Do You Think )
名称
邮箱
网址
评论
验证
   
 

 


  • 微信公众号

  • 我的微信

站点声明:

1、一号门博客CMS,由Python, MySQL, Nginx, Wsgi 强力驱动

2、部分文章或者资源来源于互联网, 有时候很难判断是否侵权, 若有侵权, 请联系邮箱:summer@yihaomen.com, 同时欢迎大家注册用户,主动发布无版权争议的 文章/资源.

3、鄂ICP备14001754号-3, 鄂公网安备 42280202422812号