Web scraping Crawl only articles/content

I want a crawler to be able to identify which pages on, for example, a news site, are actual content (i.e. articles), as opposed to About, Contact, category listings, etc. I've found no elegant way about this so far, as the criteria for content seem to vary by site (no common tags/layouts/protocols, etc.). Can anyone direct me either to libraries or methods that can identify with some level of certainty whether a website is a piece of content? It's perfectly acceptable to make this distinctio

Web scraping iMacro - Setting Variable + SaveAs CSV

I am looking for help with 2 parts of my iMacro Script... Part1 - Variable I am clicking on the follwoing line of a page in order to access the page I need to extract from. 1st Link TAG POS=**8** TYPE=A FORM=NAME:xxyy ATTR=HREF:https://aaa.aaaa.com/en/administration/xxxx.jsp?reqID=h* 2nd Link TAG POS=**9** TYPE=A FORM=NAME:xxyy ATTR=HREF:https://aaa.aaaa.com/en/administration/xxxx.jsp?reqID=h* The tag pos is the variable, how can I get this so that when running on loop, the macro will

Web scraping How to get rid of bad scrapers by CAPTCHA in Laravel?

if someone wants to load my page in his server, I want to block it. But I cannot figure out a way. For example, someone can load a page and can get components of the page ( file_get_contents($url) ). This is so basic example. Let me give you an example too. Open a new php file in your local server. And try this $file = file_get_contents('http://www.sahibinden.com/ilan/alisveris-bilgisayar-notebook-dizustu-hp-compaq-cq61-3gb-ram-320-hdd-180232745/detay'); echo $file; There are many libraries on

Web scraping Finding related keywords

I want to find some data sources from where I can scrape many keywords related to a specified technology. Suppose, I want to search for java, I want to get technology keywords related to java. e.g. spring, maven, hibernate etc. Can anybody help me with that?

Web scraping python scrapy shell exception: address “'https:” not found: [Errno 11004] getaddrinfo failed

I got an error like below: twisted.internet.error.DNSLookupError: DNS lookup failed: address "'https:" [![enter image description here][1]][1]not found: [Errno 11004] getaddrinfo failed. and this is my below version of my scrapy version: C:\>scrapy version -v Scrapy : 1.1.0 lxml : libxml2 : 2.9.0 Twisted : 16.2.0 Python : 2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 20:32:19) [MSC v.1500 32 bit (Intel)] pyOpenSSL : 16.0.0 (OpenSSL 1.0.2h 3 May 2016) Platform : Windo

Web scraping Delay between requests in Apify

Apify's legacy Crawler had a randomWaitBetweenRequests option : This option forces the crawler to ensure a minimum time interval between opening two web pages, in order to prevent it from overloading the target server. Do Apify Actors have a similar setting ? If so, how does it impact the Actor Units computation ?

Web scraping YQL "The current table has been blocked”

I'm trying to query my self-written YQL-table. If I run the table from the YQL Console, everything works fine. But if I call the table by URL via browser or application, the following error appears: The current table 'yahoo.finance.quant' has been blocked. It exceeded the allotted quotas of either time or instructions The documentation says, 1.000 queries an hour are allowed. I definitely didn't exceed that limit. Does anyone have an idea how to resolve that? And if not, is there any good alt

Web scraping Scraping invisible data from a website

I want to parse data using jsoup from a website which contains data in tables which are in a "div" tag, when I inspect table data I can see the data in inspect element window, but if I see the page source the data is not present in the div tag.(They have taken a class called "invisible" for the entire div tag) How to get that invisible data from the div tag

Web scraping Grabbed data from a given URL and put it into a file using scrapy

I am trying to scraped deeply a given web site and grab text from all over pages. I am using scrapy to scrape web site here is how i am running spider scrapy crawl stack_crawler -o items.json item.json file coming empty Here is spider code_snap # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule #from tutorial.items import TutorialItem from tutorial.items import DmozItem class StackCrawlerSpider(CrawlSpider):

Web scraping Can Watir instruct the browser to skip loading images entirely?

I'm using watir to scrape a slow Web site for simple pieces of textual data. I'd like to save my time and the server's bandwidth by entirely skipping the (multiple, slow) images each page needlessly sends my Watir-controlled browser. I realize it's up to the browser and not Watir, but I wonder if there's a convenient way to get Watir to reconfigure the browser to not load images that session. I could not readily find an answer googling.

Web scraping How to extract specific text from a webpage

I am interested in finding patterns of constellations. I am using 'Sky map' android app for visual inspection, now I want to build an app to find similar constellation structures. A sub-problem of that is to find the coordinates for specific celestial objects. Example: How can I obtain the coordinates of 'Moon' at a given time,date and location. https://theskylive.com/planetarium provides this information on their webpage in following manner. Object: Moon [info|live][less] Right Asc: 04h 15

Web scraping End chain of deferred computation with normal value

I want the the function below to take an url and return a soup node using lambdasoup, in other words I want the type to be: val do_get : string -> soup node = <fun>. With the bind operator (>>=) I can wait for a computation to finish but this always returns another deferred computation. I'd like to be able to end chain by turning a deferred computation into normal value. What can I do? let do_get url = let uri = Uri.of_string url in Cohttp_async.Client.get uri >>= f

Web scraping Python web scraping Google

How can I handle the information card when web scraping Google? such as this result and here is the html tags of information card <div class="g obcontainer mod NFQFxe" data-md="279" data-ved="2ahUKEwik2Y2vgaDtAhUFK6YKHdezAVQQkCkwAHoECAQQAA"> <!--m--> <span data-hveid="CAQQAQ"> </span> <div jscontroller="OClNZ"> <g-card> <div>

Web scraping how to fix(429)( Too many requests error)

I am getting 429 status code for even a single get request. {'Date': 'Wed, 28 Apr 2021 13:03:27 GMT', 'Server': 'Apache', 'Upgrade': 'h2', 'Connection': 'Upgrade, Keep-Alive', 'X-Powered-By': 'PHP/7.4.13', 'Retry-After': '36000', 'Content-Security-Policy': "frame-ancestors 'self' *.gsmarena.com;", 'Content-Length': '113', 'Keep-Alive': 'timeout=15, max=100', 'Content-Type': 'text/html;charset=UTF-8'} I got the above with r.headers. It says retry after 36000. Tried using chrome and moz

Web scraping How to iterate to scrape each item no matter the position

I'm using scrapy and I'm traying to scrape Technical descriptions from products. But i can't find any tutorial for what i'm looking for. I'm using this web: Air Conditioner 1 For exemple, i need to extract the model of that product: Modelo ---> KCIN32HA3AN . It's in the 5th place. (//span[@class='gb-tech-spec-module-list-description'])[5] But if i go this other product: Air Conditioner 2 The model is: Modelo ---> ALS35-WCCR And it's in the 6th position. And i only get this 60 m3 since is t

Web scraping How to use Xpath in PHP Simple HTML DOM Parser

I am learning scraping using the PHP Simple HTML DOM Parser and Xpath. Accroding to the changelog given here http://sourceforge.net/news/?group_id=218559. The PHP SImple HTML DOM Parser supports xpath generated from Firebug. But I am not able to figure out how to use it. Can anyone show me an example of the same...

Web scraping Historical weather data from NOAA

I am working on a data mining project and I would like to gather historical weather data. I am able to get historical data through the web interface that they provide at http://www.ncdc.noaa.gov/cdo-web/search. But I would like to access this data programmatically through an API. From what I have been reading on StackOverflow this data is supposed to be public domain, but the only place I have been able to find it is on non-free services like Wunderground. How can I access this data for free?

Web scraping Webscraping with Julia?

We are having a difficult time to find a good (actually, any) web scraping library or modules for the Julia language. What we need is to have some kind of facility to make it easier to parse or find html elements and strings. Maybe even some kind of crawling on demand. Update: I'm looking for something like BeautifulSoup or pyquery (both are for Python).

Web scraping Web scraping client abstraction - compatibility with future web API

I'm creating a client for a web site, which will scrap this website for data. What I would like to do, is to design API of this client in the way, that it could be used without modifications, if a web API was created in the future. Currently the website does not provide any web API. It does use AJAX, so parts of its functionality can be easily reused within the client. The biggest issue I'm dealing with now, is that some data is not identified by integers. Instead a string is used, which desc

Web scraping how to use the example of scrapy-redis

I have read the example of scrapy-redis but still don't quite understand how to use it. I have run the spider named dmoz and it works well. But when I start another spider named mycrawler_redis it just got nothing. Besides I'm quite confused about how the request queue is set. I didn't find any piece of code in the example-project which illustrate the request queue setting. And if the spiders on different machines want to share the same request queue, how can I get it done? It seems that I sh

Web scraping Power BI (Power Query) Web request results in "CR must be followed by LF" Error

When you use the Web.Page(Web.Contents('url')) function to read a table from a web page, some sites will cause an error due to inconsistent linefeeds. DataSource.Error: The server committed a protocol violation. Section=ResponseHeader Detail=CR must be followed by LF There doesn't appear to be any option you can pass to the Web functions to ignore those errors. This method works for a short while, but doesn't survive a save/refresh: let BufferedBinary = Binary.Buffer(Web.Contents("htt

Web scraping Java API to query CommonCrawl to populate Digital Object Identifier (DOI) Database

I am attempting to create a database of Digital Object Identifier (DOI) found on the internet. By manually searching the CommonCrawl Index Server manually I have obtained some promising results. However I wish to develop a programmatic solution. This may result in my process only requiring to read the index files and not the underlying WARC data files. The manual steps I wish to automate are these:- 1). for each CommonCrawl Currently available index collection(s): 2). I search ... "Search

Web scraping Can't scrape all Instagram comments due to 'hidden' comments

I'm stuck with some problem. I write a script which should scrape Instagram comments. Everything is quite nice but: this script scrapes only visible comments (but the majority of comments are hidden by this button ("show more in English). So, how can I modify my script in order to scrape all comments? def get_comment_inst(post_link_list): index = 0 comment_frame = pd.DataFrame(columns['text','user','time', "node_text",'node_name','no

Web scraping Scraping location using beautiful soup

I am super new to beautifulsoup, I have done tons of online videos and now I am adventuring to my first project. Anyway, my goal is to scrape the location of https://www.mastermindtoys.com/apps/store/find-a-store. All the locations are under one class "clearfix large-container". I am wondering how do I pull out the information of the address from all "address-sec". "address-sec" being the class that is under "clearfix large-container". If anyone has a vide

Web scraping Screen scraping in server side

I am new to screen scraping. When i use proxy server and when i track the HTTP transactions, i am getting my post datas revealed to me. So my doubt/problem here is, 1)Will it get stored in the server side or it will be revealed only to the client side? 2)Do we have an option of encrypting the post data in screen scraping? 3)Is it advisable to use screen scraping for banking applications? I am using screen scraper tool which i have downloaded it from http://www.screen-scraper.com/dow

Web scraping How to tag your pages to find your scraped content?

With several bots scraping pages on our site, I wanted to know how I could tag the content, to later search for it - find out where the scraped content ended up? I set a unique HTML comment on the pages, but that probably won't get scraped. All the links on our pages are JavaScript links, that route through a JS function - that may help the rest of our content from getting scraped. Is there a way to tag the links on the site for this purpose?

Web scraping A problems with detect data

The task is download the table with names of bookmakers and odds (here). I can not find in source code part which corresponds to these data. I tried to use chrome extension named SelectorGadget, unsuccessfuly. Similarly, when I want to open matches (matches) I meet same problem. Thank you for any advice.

Web scraping Scraping a table

I am trying to extract the properties of a house and the corresponding values. I am interested in getting {key:{Property type: Commercial property, Purchase price: CHF 475,000, etc. } I was able to extract the values one by one but not as a loop that is updating my dictionary. <dl class="row xsmall-up-2 medium-up-3 large-up-4 attributes-grid"> <div class="column"> <dt class="label-text"> Property type </dt> <dd> Commercial p

Web scraping Scrapy: export parsed data into multiple files

Id like to parse pages and then export certain items to one csv file and other to another file: using feed exports here I managed to do it for one file as follows: settings FEED_EXPORT_FIELDS = ( 'url', 'group_url', 'name', 'streetAddress', 'city', 'addressCountry', ) FEED_FORMAT = 'csv' FEED_URI = 'output/%(name)s_%(time)s.csv' But as I said the above exports to only one csv file. Id like to be able to scrape other fields to another file: FEED_EXPORT_FIELDS = (

Web scraping How to create JOBDIR settings in Scrpay Spider dynamically?

I want to create JOBDIR setting from Spider __init__ or dynamically when I call that spider . I want to create different JOBDIR for different spiders , like FEED_URI in the below example class QtsSpider(scrapy.Spider): name = 'qts' custom_settings = { 'FEED_URI': 'data_files/' + '%(site_name)s.csv', 'FEED_FORMAT': "csv", #'JOBDIR': 'resume/' + '%(site_name2)s' } allowed_domains = ['quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com'

Web scraping Retrieve openid bearer token using headless browser setup

Using OkHttp3 I was happily scraping a website for quite some time now. However, some components of the website have been upgraded and are now using an additional OpenID bearer authentication. I am 99.9% positive my requests are failing due to this bearer token because when I check with Chrome dev tools, I see the bearer token popping up only for these parts. Moreover, a couple of requests request are going to links that end with ".well-known/openid-configuration". In addition, when I hardcode

Web scraping question about web scraping(as a beginner)

I have a hobby of reading news. The problem is, there are quite a lot of websites I often go to, and this gives me an idea: building my own database of news. The idea is similar to the newspaper clippings. For example, I read something interesting about Germany economics news, therefore, I can use this software to save all the text and images from the said site(into my computer), and I can add tags such as "Germany", "econ" so I can find it and read it later. I shared this id

Web scraping Easiest way to scrape webpages to save to .csv

There is a page I want to scrape, you can pass it variables in the URL and it generates specific content. All the content is in a giant HTML table. I am looking for a way to write a script that can go through 180 of these different pages, extract specific information from certain columns in the table, do some math, and then write them to a .csv file. That way I can do further analysis myself on the data. What is the easiest way to scrape webpages, parse HTML and then store the data to a .csv

Web scraping Search bot detection

Is it possible to prevent a site from being scraped by any scrapers, but in the same time allow Search engines to parse your content. Just checking for User Agent is not the best option, because it's very easy to simulate them. JavaScript checks could be(Google execute JS) an option, but a good parser can do it too. Any ideas?

Web scraping Find and fill an input field with AutoHotKey

A challenge to all you AutoHotKey masters: Give us a function that will Find and Move the Cursor to an Input Field (E.g. LoginName) and, alternatively send input text. For the old, lazy hackers like myself just fiddling with AHK, it would look like this: FindFillField(*elementid*,*sendtext*,*alt-text*) Where elementid is the HTML id for the field, e.g. USERNAME, where sendtext is the text to fill and where alt-text could be additional, specific text to help identify the fields. Additional,

Web scraping Can't figure how phone number reveal works

I am pretty new to web-scraping and recently I am trying to automatically scrap phone number for pages like this. I am not supposed to use Selenium/headless url browser libraries and I am trying to find the a way to actually request the phone number using let say a web service or any other possible solution that could give me the phone number hopefully directly without having to go through the actual button press by selenium. I totally understand that it may not even be possible to automatical

Web scraping How do scrape table from the provided website using casperjs?

The final goal is to retrieve stock data in table form from provided broker website and save it to some text file. Here is the code, that I managed to compile so far by reading few tutorials: var casper = require("casper").create(); var url = 'https://iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=60&date=2016-12-19-21-59'; var terminate = function() { this.echo("Exiting ...").exit(); }; var processPage = function() { var rows = document.querySelectorAll('#

Web scraping How can I grab data from this web site?

There is a site here (http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=46741025610365786#), that each field of this table(specified by yellow squares) shows information about one specific day. What I need to do is to read only حجم row of each field(I mean what I specified by red squares in the following photos(You should go to the tab i mentioned in first photo, to see the second photo)): And write them(store in my computer) in a text file like this: 6.832 M (14%) , 40.475 M (85%), 24

Web scraping Cannot make chromium to push button (python)

Basicaly, here is my code: driver = webdriver.Chrome() url='https://www.gpsies.com/map.do;jsessionid=9B6652B60485A9F1C92C333F683807D7.fe3?fileId=iovpsivunvmipazp' driver.get(url) driver.find_element_by_class_name('btn btn-default').click() on that website, there is a button "download" and here is the error I'm getting: Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\Program Files\JetBrains\PyCharm 2019.1.3\helpers\pydev\_pydev_bundle\pydev_umd.

Web scraping HtmlAgilityPack with .NET Core 3.1: UTF-8, text/html' is not a supported encoding name

I'm using HtmlAgilityPack v1.11.21 and since upgrading to .NET Core 3.1, I started to receive the following error while trying to load up a web page via URL: 'UTF-8, text/html' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter 'name') I found this post 'UTF8' is not a supported encoding name, but I'm not sure where or how I'm supposed to implement: System.Text.EncodingProvider

Web scraping How to change this formula to do cell referencing

I'm trying to pull some financial earning data from marketwatch website. No matter how i try i can't seem to do a cell referencing(say cell B2) for this formula. Desperately need some help, thank you very much!! =IMPORTHTML("https://www.marketwatch.com/investing/stock/AIG/financials","table",1)

Web scraping Is there any way to scrape Google Search results without getting blocked by Captcha?

Say I wanted to scrape results from searching "hi google" (just an example). I'm using Puppeteer with Node.js to scrape. I use the following code: const puppeteer = require('puppeteer'); scrape = async function () { const browser = await puppeteer.launch({headless: false}); const page = await browser.newPage(); await page.goto("https://www.google.com/search?q=hi+google&rlz=1C1CHBF_enUS879US879&oq=hi+google&aqs=chrome..69i57j0l3j46j69i60l3.1667j0j7&sourc

  1    2   3   4   5   6  ... 下一页 共 6 页