{"id":65125,"date":"2022-06-16T09:00:36","date_gmt":"2022-06-16T09:00:36","guid":{"rendered":"https:\/\/www.cryptocabaret.com\/?p=65125"},"modified":"2022-06-16T09:00:36","modified_gmt":"2022-06-16T09:00:36","slug":"analyze-web-pages-with-python-requests-and-beautiful-soup","status":"publish","type":"post","link":"https:\/\/www.cryptocabaret.com\/?p=65125","title":{"rendered":"Analyze web pages with Python requests and Beautiful Soup"},"content":{"rendered":"<p><span class=\"field field--name-title field--type-string field--label-hidden\">Analyze web pages with Python requests and Beautiful Soup<\/span><br \/>\n<span class=\"field field--name-uid field--type-entity-reference field--label-hidden\"><a title=\"View user profile.\" href=\"https:\/\/opensource.com\/users\/seth\" class=\"username\">Seth Kenlon<\/a><\/span><br \/>\n<span class=\"field field--name-created field--type-created field--label-hidden\">Thu, 06\/16\/2022 &#8211; 03:00<\/span><\/p>\n<div data-drupal-selector=\"rate-node-70074\" class=\"rate-widget-thumbs-up\" title=\"Register or Login to like.\">\n<div class=\"rate-thumbs-up-btn-up vote-pending\">1 reader likes this<\/div>\n<div class=\"rate-score\">1 reader likes this<\/div>\n<\/div>\n<div class=\"clearfix text-formatted field field--name-body field--type-text-with-summary field--label-hidden field__item\">\n<p>Browsing the web probably accounts for much of your day. But it&#8217;s an awfully manual process, isn&#8217;t it? You have to open a browser. Go to a website. Click buttons, move a mouse. It&#8217;s a lot of work. Wouldn&#8217;t it be nicer to interact with the Internet through code?<\/p>\n<p>You can get data from the Internet using Python with the help of the Python module <code>requests<\/code>:<\/p>\n<pre>\n<div class=\"geshifilter\"><div class=\"python geshifilter-python\"><span class=\"kw1\">import<\/span> requests<br><br>\nDATA <span class=\"sy0\">=<\/span> <span class=\"st0\">\"https:\/\/opensource.com\/article\/22\/5\/document-source-code-doxygen-linux\"<\/span><br>\nPAGE <span class=\"sy0\">=<\/span> requests.<span class=\"me1\">get<\/span><span class=\"br0\">(<\/span>DATA<span class=\"br0\">)<\/span><br><br><span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span>PAGE.<span class=\"me1\">text<\/span><span class=\"br0\">)<\/span><\/div><\/div><\/pre>\n<p>In this code sample, you first import the module <code>requests<\/code>. Then you create two variables: one called <code>DATA<\/code> to hold the URL you want to download. In later versions of this code, you&#8217;ll be able to provide a different URL each time you run your application. For now, though, it&#8217;s easiest to just &#8220;hard code\u201d a test URL for demonstration purposes.<\/p>\n<p>The other variable is <code>PAGE<\/code>, which you set to the response<em> <\/em>of the <code>requests.get<\/code> function when it reads the URL stored in <code>DATA<\/code>. The <code>requests<\/code> module and its <code>.get<\/code> function is pre-programmed to &#8220;read\u201d an Internet address (a URL), access the Internet, and download whatever is located at that address.<\/p>\n<p>That&#8217;s a lot of steps you don&#8217;t have to figure out on your own, and that&#8217;s exactly why Python modules exist. Finally, you tell Python to <code>print<\/code> everything that <code>requests.get<\/code> has stored in the <code>.text<\/code> field of the <code>PAGE<\/code> variable.<\/p>\n<h2>Beautiful Soup<\/h2>\n<p>If you run the sample code above, you get the contents of the example URL dumped indiscriminately into your terminal. It does that because the only thing your code does with the data that <code>requests<\/code> has gathered is print it. It&#8217;s more interesting to parse the text.<\/p>\n<p>Python can &#8220;read\u201d text with its most basic functions, but parsing text allows you to search for patterns, specific words, HTML tags, and so on. You could parse the text returned by <code>requests<\/code> yourself, but using a specialized module is much easier. For HTML and XML, there&#8217;s the <a href=\"https:\/\/beautiful-soup-4.readthedocs.io\/en\/latest\/\">Beautiful Soup<\/a> library.<\/p>\n<p>This code accomplishes the same thing, but it uses Beautiful Soup to parse the downloaded text. Because Beautiful Soup recognizes HTML entities, you can use some of its built-in features to make the output a little easier for the human eye to parse.<\/p>\n<p>For instance, instead of printing raw text at the end of your program, you can run the text through the <code>.prettify<\/code> function of Beautiful Soup:<\/p>\n<pre>\n<div class=\"geshifilter\"><div class=\"python geshifilter-python\"><span class=\"kw1\">from<\/span> bs4 <span class=\"kw1\">import<\/span> BeautifulSoup<br><span class=\"kw1\">import<\/span> requests<br><br>\nPAGE <span class=\"sy0\">=<\/span> requests.<span class=\"me1\">get<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"https:\/\/opensource.com\/article\/22\/5\/document-source-code-doxygen-linux\"<\/span><span class=\"br0\">)<\/span><br>\nSOUP <span class=\"sy0\">=<\/span> BeautifulSoup<span class=\"br0\">(<\/span>PAGE.<span class=\"me1\">text<\/span><span class=\"sy0\">,<\/span> <span class=\"st0\">'html.parser'<\/span><span class=\"br0\">)<\/span><br><br><span class=\"co1\"># Press the green button in the gutter to run the script.<\/span><br><span class=\"kw1\">if<\/span> __name__ <span class=\"sy0\">==<\/span> <span class=\"st0\">'__main__'<\/span>:<br>\n\u00a0 \u00a0 <span class=\"co1\"># do a thing here<\/span><br>\n\u00a0 \u00a0 <span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span>SOUP.<span class=\"me1\">prettify<\/span><span class=\"br0\">(<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span><\/div><\/div><\/pre>\n<p>The output of this version of your program ensures that every opening HTML tag starts on its own line, with indentation to help demonstrate which tag is a parent of another tag. Beautiful Soup is aware of HTML tags in more ways than just how it prints it out.<\/p>\n<p>Instead of printing the whole page, you can single out a specific kind of tag. For instance, try changing the print selector from <span>print(SOUP.prettify()<\/span> to this:<\/p>\n<pre>\n<span class=\"geshifilter\"><code class=\"python geshifilter-python\">\u00a0 <span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span>SOUP.<span class=\"me1\">p<\/span><span class=\"br0\">)<\/span><\/code><\/span><\/pre>\n<p>This prints just a <code><\/p>\n<p><\/code> tag. Specifically, it prints just the first <code><\/p>\n<p><\/code> tag encountered. To print all <code><\/p>\n<p><\/code> tags, you need a loop.<\/p>\n<\/p>\n<div class=\"embedded-resource-list callout-float-right\">\n<div class=\"field field--name-title field--type-string field--label-hidden field__item\">More Python resources<\/div>\n<div class=\"field field--name-links field--type-link field--label-hidden field__items\">\n<div class=\"field__item\"><a href=\"https:\/\/www.redhat.com\/en\/topics\/middleware\/what-is-ide?intcmp=7016000000127cYAAQ\">What is an IDE?<\/a><\/div>\n<div class=\"field__item\"><a href=\"https:\/\/opensource.com\/downloads\/cheat-sheet-python-37-beginners?intcmp=7016000000127cYAAQ\">Cheat sheet: Python 3.7 for beginners<\/a><\/div>\n<div class=\"field__item\"><a href=\"https:\/\/opensource.com\/resources\/python\/gui-frameworks?intcmp=7016000000127cYAAQ\">Top Python GUI frameworks<\/a><\/div>\n<div class=\"field__item\"><a href=\"https:\/\/opensource.com\/downloads\/7-essential-pypi-libraries?intcmp=7016000000127cYAAQ\">Download: 7 essential PyPI libraries<\/a><\/div>\n<div class=\"field__item\"><a href=\"https:\/\/developers.redhat.com\/?intcmp=7016000000127cYAAQ\">Red Hat Developers<\/a><\/div>\n<div class=\"field__item\"><a href=\"https:\/\/opensource.com\/tags\/python?intcmp=7016000000127cYAAQ\">Latest Python articles<\/a><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<h2>Looping<\/h2>\n<p>Create a for loop to cycle over the entire webpage contained in the <code>SOUP<\/code> variable, using the <code>find_all<\/code> function of Beautiful Soup. It&#8217;s not unreasonable to want to use your loop for other tags besides just the <code><\/p>\n<p><\/code> tag, so build it as a custom function, designated by the <code>def<\/code> keyword (for &#8220;define\u201d) in Python.<\/p>\n<pre>\n<div class=\"geshifilter\"><div class=\"python geshifilter-python\"><span class=\"kw1\">def<\/span> loopit<span class=\"br0\">(<\/span><span class=\"br0\">)<\/span>:<br>\n\u00a0 \u00a0 <span class=\"kw1\">for<\/span> TAG <span class=\"kw1\">in<\/span> SOUP.<span class=\"me1\">find_all<\/span><span class=\"br0\">(<\/span><span class=\"st0\">'p'<\/span><span class=\"br0\">)<\/span>:<br>\n\u00a0 \u00a0 \u00a0 \u00a0 <span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span>TAG<span class=\"br0\">)<\/span><\/div><\/div><\/pre>\n<p>The temporary variable <code>TAG<\/code> is arbitrary. You can use any term, such as <code>ITEM<\/code> or <code>i<\/code> or whatever you want. Each time the loop runs, <code>TAG<\/code> contains the search results of the <code>find_all<\/code> function. In this code, the <code><\/p>\n<p><\/code> tag is being searched.<\/p>\n<p>A function doesn&#8217;t run unless it&#8217;s explicitly called. You can call your function at the end of your code:<\/p>\n<pre>\n<div class=\"geshifilter\"><div class=\"python geshifilter-python\"><span class=\"co1\"># Press the green button in the gutter to run the script.<\/span><br><span class=\"kw1\">if<\/span> __name__ <span class=\"sy0\">==<\/span> <span class=\"st0\">'__main__'<\/span>:<br>\n\u00a0 \u00a0 <span class=\"co1\"># do a thing here<\/span><br>\n\u00a0 \u00a0 loopit<span class=\"br0\">(<\/span><span class=\"br0\">)<\/span><\/div><\/div><\/pre>\n<p>Run your code to see all <code><\/p>\n<p><\/code> tags and each one&#8217;s contents.<\/p>\n<h2>Getting just the content<\/h2>\n<p>You can exclude tags from being printed by specifying that you want just the &#8220;string\u201d (programming lingo for &#8220;words\u201d).<\/p>\n<pre>\n<div class=\"geshifilter\"><div class=\"python geshifilter-python\"><span class=\"kw1\">def<\/span> loopit<span class=\"br0\">(<\/span><span class=\"br0\">)<\/span>:<br>\n\u00a0 \u00a0 <span class=\"kw1\">for<\/span> TAG <span class=\"kw1\">in<\/span> SOUP.<span class=\"me1\">find_all<\/span><span class=\"br0\">(<\/span><span class=\"st0\">'p'<\/span><span class=\"br0\">)<\/span>:<br>\n\u00a0 \u00a0 \u00a0 \u00a0 <span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span>TAG.<span class=\"kw3\">string<\/span><span class=\"br0\">)<\/span><\/div><\/div><\/pre>\n<p>Of course, once you have the text of a webpage, you can parse it further with the standard Python string libraries. For instance, you can get a word count using <code>len<\/code> and <code>split<\/code>:<\/p>\n<pre>\n<div class=\"geshifilter\"><div class=\"python geshifilter-python\"><span class=\"kw1\">def<\/span> loopit<span class=\"br0\">(<\/span><span class=\"br0\">)<\/span>:<br>\n\u00a0 \u00a0 <span class=\"kw1\">for<\/span> TAG <span class=\"kw1\">in<\/span> SOUP.<span class=\"me1\">find_all<\/span><span class=\"br0\">(<\/span><span class=\"st0\">'p'<\/span><span class=\"br0\">)<\/span>:<br>\n\u00a0 \u00a0 \u00a0 \u00a0 <span class=\"kw1\">if<\/span> TAG.<span class=\"kw3\">string<\/span> <span class=\"kw1\">is<\/span> <span class=\"kw1\">not<\/span> <span class=\"kw2\">None<\/span>:<br>\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 <span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span><span class=\"kw2\">len<\/span><span class=\"br0\">(<\/span>TAG.<span class=\"kw3\">string<\/span>.<span class=\"me1\">split<\/span><span class=\"br0\">(<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span><\/div><\/div><\/pre>\n<p>This prints the number of strings within each paragraph element, omitting those paragraphs that don&#8217;t have any strings. To get a grand total, use a variable and some basic math:<\/p>\n<pre>\n<div class=\"geshifilter\"><div class=\"python geshifilter-python\"><span class=\"kw1\">def<\/span> loopit<span class=\"br0\">(<\/span><span class=\"br0\">)<\/span>:<br>\n\u00a0 \u00a0 NUM <span class=\"sy0\">=<\/span> <span class=\"nu0\">0<\/span><br>\n\u00a0 \u00a0 <span class=\"kw1\">for<\/span> TAG <span class=\"kw1\">in<\/span> SOUP.<span class=\"me1\">find_all<\/span><span class=\"br0\">(<\/span><span class=\"st0\">'p'<\/span><span class=\"br0\">)<\/span>:<br>\n\u00a0 \u00a0 \u00a0 \u00a0 <span class=\"kw1\">if<\/span> TAG.<span class=\"kw3\">string<\/span> <span class=\"kw1\">is<\/span> <span class=\"kw1\">not<\/span> <span class=\"kw2\">None<\/span>:<br>\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 NUM <span class=\"sy0\">=<\/span> NUM + <span class=\"kw2\">len<\/span><span class=\"br0\">(<\/span>TAG.<span class=\"kw3\">string<\/span>.<span class=\"me1\">split<\/span><span class=\"br0\">(<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span><br>\n\u00a0 \u00a0 <span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"Grand total is \"<\/span><span class=\"sy0\">,<\/span> NUM<span class=\"br0\">)<\/span><\/div><\/div><\/pre>\n<h2>Python homework<\/h2>\n<p>There&#8217;s a lot more information you can extract with Beautiful Soup and Python. Here are some ideas on how to improve your application:<\/p>\n<ul>\n<li><a href=\"https:\/\/opensource.com\/article\/17\/3\/python-tricks-artists-interactivity-Python-scripts\">Accept input<\/a> so you can specify what URL to download and analyze when you launch your application.<\/li>\n<li>Count the number of images (<code><img><\/code> tags) on a page.<\/li>\n<li>Count the number of images (<code><img><\/code> tags) within another tag (for instance, only images that appear in the <code><main><\/main><\/code> div, or only images following a <code><\/code> tag).<\/li>\n<\/ul>\n<\/div>\n<div class=\"clearfix text-formatted field field--name-field-article-subhead field--type-text-long field--label-hidden field__item\">\n<p>Follow this Python tutorial to easily extract information about web pages.<\/p>\n<\/div>\n<div class=\"field field--name-field-lead-image field--type-entity-reference field--label-hidden field__item\">\n<article class=\"media media--type-image media--view-mode-caption\">\n<div class=\"field field--name-field-media-image field--type-image field--label-hidden field__item\">  <img decoding=\"async\" src=\"https:\/\/www.cryptocabaret.com\/wp-content\/uploads\/2022\/06\/python_programming_question.png\" width=\"1041\" height=\"585\" alt=\"Python programming language logo with question marks\" title=\"Python programming language logo with question marks\" loading=\"lazy\"><\/div>\n<div class=\"field field--name-field-caption field--type-text-long field--label-hidden caption field__item\"><span class=\"caption__byline\">Image by: <\/span><\/p>\n<p>Opensource.com<\/p>\n<\/div>\n<\/article>\n<\/div>\n<div class=\"field field--name-field-tags field--type-entity-reference field--label-hidden field__items\">\n<div class=\"field__item\"><a href=\"https:\/\/opensource.com\/tags\/python\" hreflang=\"en\">Python<\/a><\/div>\n<\/p><\/div>\n<div class=\"field field--name-field-listicle-title field--type-string field--label-hidden field__item\">What to read next<\/div>\n<div class=\"field field--name-field-listicles field--type-entity-reference field--label-hidden field__items\">\n<div class=\"field__item\"><a href=\"https:\/\/opensource.com\/article\/21\/9\/web-scraping-python-beautiful-soup\" hreflang=\"und\">A guide to web scraping in Python using Beautiful Soup<\/a><\/div>\n<div class=\"field__item\"><a href=\"https:\/\/opensource.com\/article\/20\/5\/web-scraping-python\" hreflang=\"und\">A beginner&#8217;s guide to web scraping with Python<\/a><\/div>\n<\/p><\/div>\n<div class=\"field field--name-field-default-license field--type-list-string field--label-hidden field__item\"><a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-sa\/4.0\/\"><br \/>\n        <img decoding=\"async\" alt=\"Creative Commons License\" src=\"https:\/\/www.cryptocabaret.com\/wp-content\/uploads\/2022\/06\/cc-by-sa--20.png\" title=\"This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.\"><\/a>This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.<\/div>\n<section class=\"field field--name-field-comments field--type-comment field--label-hidden comment-wrapper\">\n<div class=\"comments__count\">\n<div class=\"login\"><a href=\"https:\/\/opensource.com\/user\/register?absolute=1\">Register<\/a> or <a href=\"https:\/\/opensource.com\/user\/login?current=\/rss.xml&amp;absolute=1\">Login<\/a> to post a comment.<\/div>\n<\/p><\/div>\n<\/section>\n<p class=\"wpematico_credit\"><small>Powered by <a href=\"http:\/\/www.wpematico.com\" target=\"_blank\" rel=\"noopener\">WPeMatico<\/a><\/small><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Analyze web pages with Python requests and Beautiful Soup Seth Kenlon Thu, 06\/16\/2022 &#8211; 03:00 1 reader likes this 1 reader likes this Browsing the web probably accounts for much of your day. But it&#8217;s an awfully manual process, isn&#8217;t it? You have to open a browser. Go to a website. Click buttons, move a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":65126,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[307],"tags":[],"class_list":["post-65125","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-open-source"],"_links":{"self":[{"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=\/wp\/v2\/posts\/65125","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=65125"}],"version-history":[{"count":0,"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=\/wp\/v2\/posts\/65125\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=\/wp\/v2\/media\/65126"}],"wp:attachment":[{"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=65125"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=65125"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.cryptocabaret.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=65125"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}