Skip to content

Proxies working with vanilla requests, but get a 403 error with Crawlee #1683

@DoctorEvil92

Description

@DoctorEvil92

I'm trying to scrape a site that needs geotargeting for Colombian IP addresses, it does return big response size with vanilla requests with no headers. But with Crawlee, it always just returns a 403 error. Here is the code. I left proxy_url blank to not expose my credentials, you would need to put a proxy link there that is targeting Colombian IPs.

import requests
from crawlee.router import Router
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee import ConcurrencySettings, Request
from crawlee.sessions import SessionPool
from crawlee.proxy_configuration import ProxyConfiguration
import asyncio
from datetime import timedelta


async def main():
    # define links
    target_url = "https://www.exito.com/" # target site
    proxy_url = "" # left blank to not expose my credentials

    # first do vanilla request
    r = requests.get(target_url, timeout=30, proxies={"http":proxy_url, "https":proxy_url})
    print("Content length with vanilla requests:", len(r.text))


    # define router
    router = Router[BeautifulSoupCrawlingContext]()
    @router.handler("MAIN")
    async def main_handler(context : BeautifulSoupCrawlingContext) -> None:
        print("inside handler")
        response = await context.http_response.read()
        print(response[0:50])
        return
    
    # then try to do a request with crawlee with the same proxy
    crawler = BeautifulSoupCrawler(request_handler=router,
                                   concurrency_settings=ConcurrencySettings(desired_concurrency=1, max_concurrency=1),
                                   max_request_retries=15,
                                   session_pool=SessionPool(max_pool_size=1,
                                                            create_session_settings={'max_usage_count': 999999999, "max_age":timedelta(hours=999999), 'max_error_score': 100000}),
                                   proxy_configuration=ProxyConfiguration(proxy_urls=[proxy_url]) )

    # run it
    await crawler.run( [Request.from_url(target_url, label='MAIN')] )

    
    

if __name__ == '__main__':
    asyncio.run(main())

I understand this maybe could be tricky to recreate, so I will post also an image of what it prints out when ran. You can see that with vanilla requests it works, this is what always happens when I run it.

Image

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions