Extract URL in JavaScript, Python, and Bash: Quick Recipes

Quick recipes to extract URLs — JavaScript, Python, Bash

JavaScript (browser or Node)

  • Use a robust regex to find http/https URLs:

javascript

const text = “Visit https://example.com/page?x=1 and http://sub.example.org.”; const urls = [text.matchAll(/https?:\/\/[^\s”‘<>]+/gi)].map(m => m[0]); console.log(urls);
  • To extract hrefs from HTML in browser:

javascript

const anchors = Array.from(document.querySelectorAll(‘a[href]’)); const hrefs = anchors.map(a => a.href);

Python

  • Simple regex to extract full URLs:

python

import re text = “See https://example.com and http://sub.example.org/page” pattern = r’https?://[^\s”\‘<>]+’urls = re.findall(pattern, text) print(urls)
  • Use urllib/BeautifulSoup for HTML-safe extraction:

python

from bs4 import BeautifulSoup html = linkext soup = BeautifulSoup(html, “html.parser”) urls = [a.get(‘href’) for a in soup.findall(‘a’, href=True)] print(urls)

Bash

  • Using grep (PCRE) to extract http/https links from a file or stdin:

bash

# GNU grep with -P (PCRE) and -o to print only matches grep -oP ‘https?://[^\s”’</span>’<>]+’ file.txt
  • Using awk (portable-ish):

bash

awk ’{ while (match(\(0, /https?:\/\/[^ \t"'</span><span class="token" style="color: rgb(57, 58, 52);">\</span><span>'</span><span class="token" style="color: rgb(163, 21, 21);">'<>]+/)) { </span><span class="token" style="color: rgb(163, 21, 21);"> print substr(\)0, RSTART, RLENGTH) \(0 = substr(\)0, RSTART+RLENGTH) } }’ file.txt

Notes (concise)

  • Regexes above work for common cases but can miss edge cases (nested parentheses, punctuation).
  • For HTML, prefer HTML parsers (DOM in JS, BeautifulSoup in Python) over regex.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *