问题描述:

I am in need of a regex that will grab stock symbols from a list of words. More specifically, I need only the stock symbol (no prices or random symbols around the stock symbol like @ or .... or #) and it to recognize AMZN is the same as amzn. Is this possible with one regex?

Code:

def read_file(fileName):

return open(fileName).read().split()

def get_frequency(words):

freq = {}

for w in words:

if "$" in w:

freq[w] = freq.get(w, 0) + 1

return freq

def print_frequency(words):

for word, frequency in words.items():

print word, ":", frequency

def main():

stringText = read_file(file)

print_frequency(get_frequency(stringText))

main()

sample input:

@jimcramer @Taarriqq @AnthonyTanpoco A bit more specific on $GPRO . Thx

Stock Contest!! Pick $GLD and WIN a FREE Tablet!! Enter NOW! Click here for details:

http://t.co/gW8Rohq8TT $BBRD $TLT $MRK $GPRO ~

Stock Contest!! Pick $GDXJ and WIN a FREE Tablet!! Enter NOW! Click here for details:

http://t.co/ekGDWveFh2 $GRCU $LOCO $XIV $TSO ~

The Closing Bell is out! http://t.co/rQE910SvNL $EURUSD $GBPUSD $USDJPY $AUDUSD $SPY $TWTR $GPRO $YHOO

$LNKD $FB $AAPL $BRD $CAT $WLT $LNKD

@8DVolition great I will do the same. $GPRO full retard. Shorts have been prison raped. http://t.co/zOnc7WgjX0

@8DVolition I staking LOCO and $GPRO for a short but the infection point when you know the boyzzz have bailed hasn't happened yet.

@8DVolition and fucking $LOCO

@8DVolition I mean look at $GPRO that pos http://t.co/ULQtdLAyvZ Verified $181.97 loss in $RADA Cut my losses too quickly.

Stock ran the next day after I exited.todays winners against the bearishness - $AMZN $LULU $KORS $WSM $POT $AGN $LOCO $SCTY $FSLR $EBAY $UVXY $RUBI & all financials

Multimillionaire trader SUPERTRADES and his EASY to replicate strategy can make you money! http://t.co/Ho9ydXHTWl $ARWR $TWTR $LOCO

Just checked $GPRO ..$25 to $70 wow!

http://t.co/CLn9obslnu Verified $2,503.58 profit in $GPRO Week of 9/8

http://t.co/KaXsGKaX5v Verified $1,495.17 profit in $GPRO Week of 9/8

http://t.co/hQoJG9hjpq Verified $398.90 loss in $GPRO Week of 9/8

http://t.co/5lbEFFZOTl Verified $585.66 profit in $GPRO Week of 9/8

@JPelletier22 damn right bro, need you for that. We should've bought you some $LOCO!

http://t.co/rcHjORcFpf Verified $4,293.01 loss in $RADA Week of 9/8

@8DVolition have to tread lightly though...a la $GPRO widowmaker

$GPRO will only continue to grow over the next 10 years. This will be a $100 stock within 6 months from now.

Stock Contest!! Pick $EBAY and WIN a FREE Tablet!! Enter NOW! Click here for details: http://t.co/Zhx90b0JuP $PG $CENX $BRK/B $GPRO ~

Watch This FREE VIDEO On How We Made $100,000+ http://t.co/D4ZEMzcp6W on $NETE $OTIV $ISNS $RADA learn $TWTR $STUDY

Multimillionaire trader SUPERTRADES and his EASY to replicate strategy can make you money! http://t.co/Ho9ydXHTWl $ARWR $TWTR $LOCO

Stock Contest!! Pick $DIS and WIN a FREE Tablet!! Enter NOW! Click here for details: http://t.co/CjtKbbArjo $VGZ $PLUG $AA $LOCO ~

$BABA so much buying will go into this will trend trade it for 10 points and scalp it every day for 3 points = 15 points a week $GPRO style

Another $20,000ish profit week for me thanks to $ISNS $NETE $OTIV $RADA $EKSO now time for some weekend beach/fun! http://t.co/bP2IhYIu79

For $147/month you get @super_trades' nightly watchlist & LIVE chatroom access: $AAPL $TWTR $FB $BABA $YHOO $LOCO $GPRO $TSLA $GOOG $LNKD

sample output:

$BIIB : 2

$THRX : 1

$CNE.TO : 1

$nflx : 4

$THRM : 2

$GPRO,...Fully : 1

$EFOI : 17

$4. : 2

$ILMN : 1

$0.10 : 1

$XLY : 7

$EXC : 2

$XLE : 3

$XLF : 11

$48 : 1

$XLB : 3

$1,000,000 : 1

$42 : 4

$40 : 3

$47 : 1

$XLI : 1

$45 : 4

$XLK : 2

$SCOK; : 1

$EXEL... : 1

$VALE : 7

$IVDN : 2

$Gpro : 2

$AEO : 1

$AEM : 2

$SCOK. : 3

$SCOK, : 14

$blue, : 1

$GIG : 1

$UNH : 1

$UNG : 2

网友答案:

From what I see, all your stock symbols start with $ and appear as independent words. That makes regex unnecessary. By avoiding regex, this solution should be faster:

from collections import Counter
with open('input') as f:
    words = f.read().upper().split()
symbols = [word for word in words if len(word) > 1 and word[0]=="$" and word[1:].isalpha()]
freqs = Counter(symbols)
for key in sorted(freqs):
    print '%-8s : %3i' % (key, freqs[key])

The frequency data is obtained here with the collections module which appeared in python2.7+/3.1+. If you are using an earlier version, try:

with open('input') as f:
    words = f.read().upper().split()
symbols = [word for word in words if len(word) > 1 and word[0]=="$" and word[1:].isalpha()]
freqs = Counter(symbols)
freqs = {}
for sym in symbols:
    freqs[sym] = freqs.get(sym, 0) + 1
for key in sorted(freqs):
    print '%-8s : %3i' % (key, freqs[key])

The first several lines of the output look like:

$AA      :   1
$AAPL    :   2
$AGN     :   1
$AMZN    :   1
$ARWR    :   2
$AUDUSD  :   1
$BABA    :   2
$BBRD    :   1
$BRD     :   1

Notes:

  • with open('input') as f:

    This construct assures that the file is closed as soon as it is no longer needed.

  • words = f.read().upper().split()

    This reads the file, converts all alphabetical characters to upper case, and then splits the text into words.

  • symbols = [word for word in words if len(word) > 1 and word[0]=="$" and word[1:].isalpha()]

    This selects the symbols from the words by requiring that (1), including the dollar sign, they are at least two characters long, (2) they start with a dollar sign, and (3) the remainder of the word after the dollar sign is alphabetic. This test eliminates the need for regex.

网友答案:

Use Counter to help keep track of the stock counts, and re.I will make your regular expression case insensitive:

>>> import re
>>> from collections import Counter
>>> exp = r'\$([A-Z]{4})'
>>> stocks = Counter()
>>> with open('stock.txt') as f:
...     for line in f:
...         stocks.update(re.findall(exp, line, re.I))
...
>>> stocks.most_common()
[('GPRO', 16), ('LOCO', 8), ('TWTR', 5), ('RADA', 4), ('LNKD', 3), ('AAPL', 2), ('ISNS', 2), ('OTIV', 2), ('BABA', 2), ('NETE', 2), ('YHOO', 2), ('EBAY', 2), ('ARWR', 2), ('GOOG', 1), ('AUDU', 1), ('TSLA', 1), ('AMZN', 1), ('KORS', 1), ('PLUG', 1), ('CENX', 1), ('GBPU', 1), ('STUD', 1), ('FSLR', 1), ('EURU', 1), ('RUBI', 1), ('LULU', 1), ('USDJ', 1), ('GDXJ', 1), ('GRCU', 1), ('EKSO', 1), ('BBRD', 1), ('SCTY', 1), ('UVXY', 1)]

John had some good points in the comments, and with his suggestion, here is the update (which also picked up $FB):

>>> exp = r'\$([A-Z]{1,4})'
>>> stocks = Counter()
>>> with open('stock.txt') as f:
...     for line in f:
...        stocks.update(list(map(str.upper, re.findall(exp, line, re.I))))
...
>>> stocks.most_common()
[('GPRO', 16), ('LOCO', 8), ('TWTR', 5), ('RADA', 4), ('LNKD', 3), ('AAPL', 2), ('FB', 2), ('ISNS', 2), ('NETE', 2), ('YHOO', 2), ('OTIV', 2), ('EBAY', 2), ('BABA', 2), ('ARWR', 2), ('TSO', 1), ('AUDU', 1), ('VGZ', 1), ('TSLA', 1), ('AGN', 1), ('GLD', 1), ('CAT', 1), ('DIS', 1), ('WSM', 1), ('AMZN', 1), ('PLUG', 1), ('SPY', 1), ('CENX', 1), ('POT', 1), ('GBPU', 1), ('GOOG', 1), ('PG', 1), ('STUD', 1), ('RUBI', 1), ('BRK', 1), ('KORS', 1), ('AA', 1), ('EURU', 1), ('TLT', 1), ('WLT', 1), ('LULU', 1), ('USDJ', 1), ('GDXJ', 1), ('GRCU', 1), ('XIV', 1), ('MRK', 1), ('BBRD', 1), ('FSLR', 1), ('EKSO', 1), ('SCTY', 1), ('UVXY', 1), ('BRD', 1)]
网友答案:

A RegExp that grabs the pattern you've requested:

re.match("([A-Z]{4})|([a-z]{4})", string)

It basically gets 4 uppercase letters or 4 lowercase letters. However, if your string has long texts, it will get the parts of the text as well.

If that's the case, you should pick together the names of all stock options you're trying to grab, and check as part of a structure.

网友答案:
(\$[\w,]+\b)

You can try this.This will return a list of all $words.Then you can count from there.

See demo.

http://regex101.com/r/jT3pG3/42

相关阅读:
Top