scraping images python beautifulsoup

Scraping images python beautifulsoup tutorial : Uaing the Scrapy we can get images from the internet and give these images as the input to PyTesseract. It is a little thing that will not be solved in some cases. Scraping data from websites not complicated as it used to be. We are using these scraped data for many purposes and many tools available for doing this job. Many tools are available on the internet Scrapy along with PyTesseract is one of the best combos we can work with. The pytesseract gives the text contents of the image as text data. We can simply use this data.

Scraping a bunch of data from a website is a somewhat difficult job and done it manually.
It is also called as the process of extracting the information from websites and involve in downloading web pages.

Submitting the form like this:

Let’s Go to page
Then view source
Also find a form
Then note the action url
Also make a note of the field names
Then make sure honeypot fields will be handled properly
Writing a few lines of code to prepare data for submission
Submit the correct url

The beautifulsoup is the parsing library that will enable us to extract the XML and HTML documents.
It will detect the gracefully and encoding handle documents with the special characters.

Scraping images python Example:-

Import urlib, urllib2
Req=urlib2.Request (http://example.com/form/submit/urldata=urlib.urlencode ({‘field1:’value’,’field2’:’value’, ’field3’:’value’}),
Headers= {‘user-agent’: ‘Mozilla something’, ‘cookie’: name=value, name2=value2’})
Response=urllib2.urlopen (req)
Images, video, text, audio are used in Python to download data from the web.
Download_baidu (word)
Download_google (word)

Code to scrape images:-

Import re
Imports requests
From bs4 import Beautifulsoup
From urllib.parse import urlparse
Import os
Def download_baidu (keyword):
url=’https://image.baidu.com/search? tn=baiduimage&ie=utf-8&word=’+word+’&ct=201326592&v=flip’
result=requests. get (url)
html=result. text
pic_url=re.findall (“objURL” :”(.*?’ html, re.S)
i=0
for each in pic_url:
print(pic_url)
try:
pic=requests. get (each, timeout=10)
except requests.exceptions.connectionError:
print(‘exception‘)
continue
string=’pictures’+keyword+’’+str (i) +’.jpg’
fp=open (string.’wb’)
fp.write (pic.content)
fp.close ()
i+=1
def download_google (word):
url=’https://www.google.com/search?q=’+word+’&client=opera&hs=cTQsource=lnms&tbm=isch&sa=x&ved=0ahUKEwig3Lox4PZKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982’
page=requests. get (url).text
soup=Beautifulsoup (page,’html.parser’)
for raw_img in soup.find_all (‘img’):
link=raw_img.get (‘src’)
os. System(“wget”+link)
if_name__==’__main__’:
word=input (Input key word:”)
download_baidu(word)

Lines 1–6:-

Import the libraries to run the code and BeautifulSoup to give it an alias bs.
Requests library is used to fetch content from a given link. Urllib.request is another package that helps in opening and reading URLs.
argparse allows parsing arguments passed with the file execution.
Os provides functionalities to interact with the file system and all packages but BeautifulSoup is a part of the Python standard library.
Lines 8–12: -
Initialize the argument parser and parse the filename argument.
Lines 14- 21:-
os.getcwd () will return the path to the current working directory.
Split out the .csv extension from the file name, and join it with the current working directory to form our desired output directory to save the images in.
Lines 23–25:-
Use the open method to open the csv file and read the file and split it on the delimiter in a csv file also the links will hold a list of links to image display pages.
Lines 27–28:-
We find the length of links and print this information then the number of images will be downloaded.
Lines 30–34: -
We create a function to accept an image URL and download it.

Lines 36–39: -
The loop over each hyperlink href in the image will display links and using the get method in requests library, fetch the URL.
At line 40–41:-
thesoup.find_all ('meta', attrs= {"name":"twitter: image”}) this method will look for all Meta tags with attribute.
At line 41:-
Then we use the string method to replace modified the image link and used the download_image function to download the image.

Python Code Editor:-

The imageScraper will depend on requests, setproctitle and depend on pythreadpool which can be downloaded and installed.
Import image_scraper
Image_scraper.scrape_images (URL)

Options:-

-h--help	We have to print help
-s –save-dir<path>>	Name of folder to save the image
-m—max-images<number>	It is maximum number of images to be scraped
formats>	We specify format of image to be scraped
--dump-urls	Then print URL of image

Examples:-

Scrape all the images
$image-scraper_ananth.co.in/test.html
Scrape max 2 images
$image-scraper-m 2 ananth.co.in/test.html
Scrape gifs and download to folder ./mygifs
$image-scraper-s mygifs ananth.co.in/test.html—formats gif