Scrapy Projesi Oluşturma¶

Bu yazıda scrapy ile proje oluşturma, spider oluşturma işlemlerini gerçekleştireceğiz.

Scrapy nedir, nasıl kurulur, nasıl kullanılır diye görmek için bu yazıyı okuyabilirsiniz.

Proje Oluşturma¶

scrapy_proje adında bir klasör oluşturalım. Sanal ortamı(venv) aktif ettikten sonra aşağıdaki komut ile scrapy_proje klasöründe bir proje başlatalım. Settings gibi dosyaların src klasöründe olmasını istiyorum

scrapy_proje $ scrapy startproject src .

- Proje dosya ve klasör yapısını görelim

$ tree -I venv .
.
├── requirements.txt
├── scrapy.cfg
└── src
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

2 directories, 8 files

- scrapy genspider OrumcekAdam HEDEF_SITE komutunu vererek, projeye ait spider'lar oluşturabiliriz. scrapy genspider OrumcekAdam toscrape.com komutu ile toscrape.com'u hedef alan OrumcekAdam adında bir spider oluştururuz.

$ scrapy genspider OrumcekAdam toscrape.com

- Çıktı olarak şöyle bir mesaj alırız:

Created spider 'OrumcekAdam' using template 'basic' in module:
  src.spiders.OrumcekAdam

- Oluşan spider dosyasını şöyle görebiliriz:

tree -I venv .               
.
├── requirements.txt
├── scrapy.cfg
└── src
    ├── __init__.py
    ├── __pycache__
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── OrumcekAdam.py # yeni
        ├── __init__.py
        └── __pycache__

4 directories, 9 files

- OrumcekAdam.py içeriği de aşağıdaki gibidir:

import scrapy


class OrumcekadamSpider(scrapy.Spider):
    name = "OrumcekAdam"
    # Spider'ın gezeceği domainlerin listesi.
    allowed_domains = ["toscrape.com"]
    # Bilgi çekeceğimiz URL'ler.
    start_urls = ["https://toscrape.com"]

    def parse(self, response):
        pass

- Şimdi bu dosyayı aşağıdaki gibi düzenleyelim:

import scrapy


class OrumcekadamSpider(scrapy.Spider):
    name = 'OrumcekAdam'
    allowed_domains = ['toscrape.com']
    #########################################
    ## 'quotes' sub_domainini yerleştirdik ##
    start_urls = ['http://quotes.toscrape.com/random'] 

    def parse(self, response):
        yield{
        'author_name' : response.css('small.author::text').extract_first(),
        'text' : response.css('span.text::text').extract_first(),
        'tags' : response.css('a.tag::text').extract_first(),
        }

Çalıştırmak için Yöntemler:¶

scrapy crawl OrumcekAdam -o OrumcekAdam.csv -t csv diyerek çıktıyı csv formatında alabilirsiniz
scrapy crawl OrumcekAdam -o OrumcekAdam.json -t json diyerek çıktıyı json formatında alabilirsiniz.

Örnek Çıktılar

OrumcekAdam.csv içeriği

author_name,text,tags
Albert Einstein,"“If you can't explain it to a six year old, you don't understand it yourself.”",simplicity

OrumcekAdam.json içeriği

[
    {
        "author_name": "Friedrich Nietzsche",
        "text": "“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”",
        "tags": "friendship"
    }
]

Hedef site rastgele bir alıntı(quote) cevap olarak döndüğü için her çalıştırmada farklı bir çıktı verir.

Not: Yazdığınız spider'ı projeden bağımsız şekilde çalıştırmak için

scrapy runspider OrumcekAdam.py -o DOSYA_ADI.json

Hedef siteyi değiştirerek çoklu veri çekmeyi görelim.

import scrapy

class OrumcekadamSpider(scrapy.Spider):
    name = 'OrumcekAdam'
    allowed_domains = ['toscrape.com']
    # hedef site URL'si değişti.
    # start_urls = ['http://quotes.toscrape.com/random'] 
    start_urls = ['http://quotes.toscrape.com'] 

    def parse(self, response):
        for quote in response.css('div.quote'):
            m_dict = {
                'author_name' : quote.css('small.author::text').extract_first(),
                'text' : quote.css('span.text::text').extract_first(),
                'tags' : quote.css('a.tag::text').extract(), # tag'ler çok olabilir.
            }
            yield m_dict

start_urls adresinde random subdomaini kaldırdık. Web sayfasını ziyaret ederseniz 10 ade quote sergilendiğini göreceksiniz.
Hedef sitede her quote bir div altında bulunduğu için for döngüsü ile bütün quote'lar üzerinde gezebiliriz.

Çıktı olarak json dosyasına yazdığımızda aşağıdaki gibi bir sonuç alırız.

[
{"author_name": "Albert Einstein", "text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"author_name": "J.K. Rowling", "text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "tags": ["abilities", "choices"]},
{"author_name": "Albert Einstein", "text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
{"author_name": "Jane Austen", "text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”", "tags": ["aliteracy", "books", "classic", "humor"]},
{"author_name": "Marilyn Monroe", "text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", "tags": ["be-yourself", "inspirational"]},
{"author_name": "Albert Einstein", "text": "“Try not to become a man of success. Rather become a man of value.”", "tags": ["adulthood", "success", "value"]},
{"author_name": "André Gide", "text": "“It is better to be hated for what you are than to be loved for what you are not.”", "tags": ["life", "love"]},
{"author_name": "Thomas A. Edison", "text": "“I have not failed. I've just found 10,000 ways that won't work.”", "tags": ["edison", "failure", "inspirational", "paraphrased"]},
{"author_name": "Eleanor Roosevelt", "text": "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", "tags": ["misattributed-eleanor-roosevelt"]},
{"author_name": "Steve Martin", "text": "“A day without sunshine is like, you know, night.”", "tags": ["humor", "obvious", "simile"]}
]