-
Notifications
You must be signed in to change notification settings - Fork 547
Open
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
I'm trying to scrape a site that needs geotargeting for Colombian IP addresses, it does return big response size with vanilla requests with no headers. But with Crawlee, it always just returns a 403 error. Here is the code. I left proxy_url blank to not expose my credentials, you would need to put a proxy link there that is targeting Colombian IPs.
import requests
from crawlee.router import Router
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee import ConcurrencySettings, Request
from crawlee.sessions import SessionPool
from crawlee.proxy_configuration import ProxyConfiguration
import asyncio
from datetime import timedelta
async def main():
# define links
target_url = "https://www.exito.com/" # target site
proxy_url = "" # left blank to not expose my credentials
# first do vanilla request
r = requests.get(target_url, timeout=30, proxies={"http":proxy_url, "https":proxy_url})
print("Content length with vanilla requests:", len(r.text))
# define router
router = Router[BeautifulSoupCrawlingContext]()
@router.handler("MAIN")
async def main_handler(context : BeautifulSoupCrawlingContext) -> None:
print("inside handler")
response = await context.http_response.read()
print(response[0:50])
return
# then try to do a request with crawlee with the same proxy
crawler = BeautifulSoupCrawler(request_handler=router,
concurrency_settings=ConcurrencySettings(desired_concurrency=1, max_concurrency=1),
max_request_retries=15,
session_pool=SessionPool(max_pool_size=1,
create_session_settings={'max_usage_count': 999999999, "max_age":timedelta(hours=999999), 'max_error_score': 100000}),
proxy_configuration=ProxyConfiguration(proxy_urls=[proxy_url]) )
# run it
await crawler.run( [Request.from_url(target_url, label='MAIN')] )
if __name__ == '__main__':
asyncio.run(main())
I understand this maybe could be tricky to recreate, so I will post also an image of what it prints out when ran. You can see that with vanilla requests it works, this is what always happens when I run it.
Metadata
Metadata
Assignees
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.