2017-08-02

Web Scraping by Python3

Begging
Get an image on a page
Access some URLs and get an image per a page automatically
Finally
Note

Begging

When you try to implement a machine learning or expect the stock price for the future, have you ever thought you didn't have enough data for your experiment? Let's study how to gather the information you want on the internet by scraping.

In this article, we can get some images from a Japanese website. If you use the images that you get by scrapping for your business or website, your deeds might be illegal. You should be careful about using web scrapping and need to take full responsibility for your actions.

Get an image on a page

environment

・Python 3.6.0

Step

Access an URL by python
Get HTML of the page
Extract src from HTML
Save the image

※You should not access the page you try to access many times, otherwise, the server will be oppressed.

1. Access an URL by python

We use urllib.request, which is a library for access to a URL.

import urllib.request

html = urllib.request.urlopen('https://beauty.hotpepper.jp/slnH000218616/style/L002266500.html?cstt=1')  
print(html)

This action means you just access the URL, which means you don't see HTML. Just return the values people can not read.

2. Get HTML of the page

bs4 is a library for converting into the text people can read. If you don't have bs4, Let's run this code pip install beautifulsoup4.

import urllib.request
import bs4

html = urllib.request.urlopen('https://beauty.hotpepper.jp/slnH000218616/style/L002266500.html?cstt=1')  
soup = bs4.BeautifulSoup(html, 'html.parser')  
print(soup)

You can receive HTML by above code. bs4 is more popular so far than pyQuery

3. Extract src from HTML

You should need to understand the structure of HTML.

photos = soup.select("div.pr img.bdImgGray")  
paths = list(map(lambda path: path.get('src'), photos))  
print(paths)

If you are interested in other ways to get HTML. You can read this article even though this article is written in Japanese.

PythonとBeautiful Soupでスクレイピング

4. Save the image

You can change the file name whatever you want.

from urllib.request

fileName = 'bijo.jpg'  
urllib.request.urlretrieve(paths[0], fileName)

You can get the picture of a girl.

So far

Here is the code so far.

import urllib.request
import bs4

html = urllib.request.urlopen('https://beauty.hotpepper.jp/slnH000218616/style/L002266500.html?cstt=1')  
soup = bs4.BeautifulSoup(html, 'html.parser')

photos = soup.select("div.pr img.bdImgGray")  
paths = list(map(lambda path: path.get('src'), photos))

fileName = 'bijo.jpg'  
urllib.request.urlretrieve(paths[0], fileName)

Access some URLs and get an image per a page automatically

Step

Make sure the structure of the website you want to be scrapping
Get HTML
Save the URL list from HTML to a text file
Loop step 3
Extract info you want to get from the text file you create in step 4

1. Make sure the structure of the website you want to be scrapping

If you want to get those images under this category.
https://beauty.hotpepper.jp/catalog/ladys/lenHL03/

You might notice the features of the structure.
https://beauty.hotpepper.jp/CSP/bt/hairCatalogSearch/ladys/condtion/?lengthCd=HL03&pn=2

・lengthCd=HL03 is category for hair style
・pn=2 means page nation
・pn=100 is last page

So, if you change pn=n of URL, you can get images of those pages.

2. Get HTML

You can get HTML by the way we learned above.

import urllib.request  
import bs4

html = urllib.request.urlopen('https://beauty.hotpepper.jp/CSP/bt/hairCatalogSearch/ladys/condtion/?lengthCd=HL03&pn=1')  
soup = bs4.BeautifulSoup(html, 'html.parser')  
print(soup)

3. Save the URL list from HTML to a text file

Save each URL into the text file that is named hotpepper_page_urls.txt.

from itertools import chain

columns = soup.find_all('li', class_='dibBL')
atags = map(lambda column: column.find_all('a', class_='pr'), columns)

with open('hotpepper_page_urls.txt', 'w') as f:
    for _ in chain.from_iterable(atags):
        path = _.get('href')
        if not path.startswith('https'):  # Relative path
            path = '{}{}'.format(base_url, path)
        f.write('{}\n'.format(path))

4. Loop step 3

I recommend using the sleep function for suppressing the server load.

import bs4
import urllib.request
from itertools import chain
import time
import random

for page in range(1, 4):
    html = urllib.request.urlopen('https://beauty.hotpepper.jp/CSP/bt/hairCatalogSearch/ladys/condtion/?lengthCd=HL03&pn='+str(page))
    soup = bs4.BeautifulSoup(html, 'html.parser')

    columns = soup.find_all('li', class_='dibBL')
    atags = map(lambda column: column.find_all('a', class_='pr'), columns)

    with open('hotpepper_page_urls.txt', 'w') as f:
        for _ in chain.from_iterable(atags):
            path = _.get('href')
            if not path.startswith('https'):  # Relative path
                path = '{}{}'.format(base_url, path)
            f.write('{}\n'.format(path))

    time.sleep(1 + random.randint(0, 2)) # to reduce server load

5. Extract info you want to get from the text file you create in step 4

You should change dirpath, which is a directory for saving the pictures

with open('hotpepper_page_urls.txt') as f:

    dirpath = '/Users/{name}/Desktop'
    num = 1
    for url in f:
        html = urllib.request.urlopen(url.strip())
        soup = bs4.BeautifulSoup(html, 'html.parser')

        photos = soup.select("div.pr img.bdImgGray")
        paths = map(lambda path: path.get('src'), photos)

        for path in paths:
            filepath = '{}/{}'.format(dirpath, str(num)+'.jpg')
            urllib.request.urlretrieve(path, filepath)
            num = num +1
            time.sleep(1 + random.randint(0, 2))

Finally

You can get a bunch of images of beautiful women like this.

Finally, the code is following.

# -*- coding:utf-8 -*-

import os
import bs4
import time
import random
import urllib.request
from itertools import chain

base_url = 'https://beauty.hotpepper.jp'

def fetch_page_urls(int, page):
    page_path = '/CSP/bt/hairCatalogSearch/ladys/condtion/?lengthCd=HL0'+str(int)+'&pn='+str(page)
    html      = urllib.request.urlopen('{}{}'.format(base_url, page_path))
    soup      = bs4.BeautifulSoup(html, 'html.parser')

    columns = soup.find_all('li', class_='dibBL')
    atags = map(lambda column: column.find_all('a', class_='pr'), columns)

    with open('hotpepper_page_urls_type'+str(int)+'.txt', 'w') as f:
        for _ in chain.from_iterable(atags):
            path = _.get('href')
            if not path.startswith('https'):  # Relative path
                path = '{}{}'.format(base_url, path)
            if path[-1] == '/':  # Normalize
                path = path[:-1]
            f.write('{}\n'.format(path))


def fetch_photos(int, page):

    with open('hotpepper_page_urls_type'+str(int)+'.txt') as f:
        # Make directories for saving images
        dirpath = 'photos/hotpepper/lenHL0'+str(int)+'/'+str(page)

        # make sure if the file exists
        if os.path.isfile('{}/{}'.format(dirpath, '40.jpg')) == True:
            print('already got a image')
            return False

        # make sure if the directory exists
        if os.path.isdir(dirpath) == False:
            os.makedirs(dirpath)

        num = 1
        #get URL for the image
        for url in f:
            html = urllib.request.urlopen(url.strip())
            soup = bs4.BeautifulSoup(html, 'html.parser')

            photos = soup.select("div.pr img.bdImgGray")
            # photos = soup.find_all('div', class_='fl')
            paths = map(lambda path: path.get('src'), photos)

            #Save the image to directory
            for path in paths:
                filepath = '{}/{}'.format(dirpath, str(num)+'.jpg')
                # Download image file
                urllib.request.urlretrieve(path, filepath)
                num = num +1
                # Add random waiting time (4 - 6 sec)
                time.sleep(1 + random.randint(0, 2))


if __name__ == '__main__':
    for type in range(5, 6):
        for page in range(1, 101):
            print('type='+str(type)+'&page='+str(page))
            fetch_page_urls(type, page)
            fetch_photos(type, page)

Note

Some websites don't allow to do web scraping, so you should make sure whether the site can do or not. And, if you access a website a lot, you might be banned to access by the owner of the site.